petruarakiss
indexwriting
aboutprojects
handoff

End of page. Start of the working loop.

If this site did its job, the next step is simple: ask for the private CV, read the field notes, or check the public proof. I am looking for AI product work where the model is only one part of the system.

Request CVRead the notes
available for
Staff / Principal AI product roles
best work
harnesses, retrieval, evals, workflow control
base
Madrid / remote-first
paths
writingprojectsabout
contact
emailLinkedInGitHubX
context before autonomyevals before confidenceproof beats claimscode must stay legiblesystems over demosprivate CV on requestcontext before autonomyevals before confidenceproof beats claimscode must stay legiblesystems over demosprivate CV on request
(c) 2026 Petru Arakissbuilt as a public surface; private documents stay off the web
AI Engineering

Harness engineering is the missing discipline around coding agents

OpenAI and Anthropic are converging on the same lesson: coding agents do not become reliable because the prompt is clever. They become reliable when the repo, tools, evals, traces, browser checks, and review loops make their work legible.

Apr 30, 2026
/13 min read
view raw

The prompt is not the product.

That sentence is useful, but it can also hide too much. So let us define the thing properly.

A prompt is the instruction packet a model receives for a run: the task, the surrounding context, constraints, examples, system instructions, and conversation state available at that moment. In a coding agent, the prompt may also include tool descriptions, repository instructions, previous messages, and the current plan.

Prompts matter. Bad prompts produce bad work. But a prompt is still only an input to a loop.

A harness is the loop.

It is the runtime, repository structure, tool interface, sandbox, documentation system, test suite, browser access, observability, evaluation criteria, review process, and human escalation policy around the model. It is the engineered environment that lets an agent make progress without guessing what reality looks like.

That is why harness engineering is becoming more important than prompt engineering for serious coding-agent work.

Why the term matters now

OpenAI's article on harness engineering with Codex is not interesting because it says Codex can write code. Everyone has seen models write code.

It is interesting because OpenAI describes an internal product where Codex wrote not only product code, but also tests, CI, docs, observability, and internal tools. The human role shifted upward: engineers designed environments, specified intent, and built feedback loops so agents could execute reliably.

That is the real claim.

Not "the prompt got better."

The operating system around the agent got better.

OpenAI describes making the application itself legible to Codex: local worktrees that can boot independently, browser control through Chrome DevTools, DOM snapshots, screenshots, navigation, logs, metrics, and traces exposed in local development. Once the agent can see the running product and the telemetry, tasks like startup latency or slow spans become tractable instead of aspirational.

They also describe a repository knowledge system where AGENTS.md is not a giant manual. It is a table of contents. The deeper truth lives in structured docs, architecture notes, design docs, product specs, execution plans, generated schema docs, quality documents, and references that are versioned with the repo.

That detail matters. A large instruction file feels like control, but it usually becomes stale context. A harness uses progressive disclosure: start the agent with a small stable map, then let it discover the relevant source of truth.

Anthropic's work points in the same direction. In Effective harnesses for long-running agents, the central problem is not that Claude lacks intelligence. The problem is continuity. Long tasks span multiple context windows. Each new session needs a way to know what happened before, what remains unfinished, how to run the app, and what counts as done.

Their solution is concrete: an initializer agent creates a feature list, the coding agent works one feature at a time, progress is committed to git, notes are written for the next session, and end-to-end browser testing is made explicit. The model is not asked to "be careful." The harness changes the shape of the work so careful behavior is easier.

In Harness design for long-running application development, Anthropic goes further. They use planner, generator, and evaluator roles, sprint contracts, visual criteria, and browser-driven evaluation to turn subjective work like frontend quality into something the system can actually grade. The important part is not the multi-agent architecture by itself. The important part is that taste, usability, originality, and verification are encoded into a reviewable process.

This is the pattern across serious agent work:

  • The model is not enough.
  • The prompt is not enough.
  • The environment must become legible.
  • The definition of done must become testable.
  • The handoff between humans, tools, and agents must become explicit.

That is harness engineering.

Prompt engineering versus harness engineering

Prompt engineering asks:

  • What should I tell the model?
  • What examples should I provide?
  • What tone, format, and constraints should be in the instruction?
  • How do I reduce ambiguity in this run?

Harness engineering asks:

  • What should the agent be able to see?
  • What tools should it be able to use?
  • What state survives between runs?
  • What evidence proves the work is correct?
  • What boundaries should be mechanically enforced?
  • What must remain a human decision?
  • What did this failure teach the system for next time?

The prompt is one layer of the harness. It is not the harness.

If an agent cannot find the right file, run the right tests, start the app, inspect the browser, query logs, understand the architecture, or recover from a broken intermediate state, a better prompt only gives you a more articulate failure.

The anatomy of a useful coding-agent harness

A coding-agent harness has several layers. You can build them incrementally, but skipping them explains why many agent demos feel impressive and then fall apart inside real repositories.

1. A repository map, not a sacred manual

Agents need orientation. So do humans.

The weak version is a huge instruction file full of rules, preferences, and outdated warnings. It feels thorough, but it eats context and decays quickly.

The stronger version is a compact entry point:

  • where the app entry points live
  • where product specs live
  • where architecture rules live
  • how to run checks
  • what directories are dangerous
  • where examples of good changes live
  • where known failure modes are documented

The repo then holds the deeper knowledge as versioned artifacts: architecture docs, execution plans, schemas, design rules, eval definitions, quality notes, and migration history.

This is context engineering at repository scale. The goal is not to put everything into context. The goal is to make the right thing discoverable at the right time.

2. Tools designed for agents, not just humans

Anthropic's Building effective agents makes a point that deserves more attention: tool definitions deserve as much care as prompts.

A command that is obvious to an experienced engineer can be awkward for a model. A relative filepath can become a source of failure after the agent changes directories. A JSON-escaped code field can make a simple edit more error-prone. Two tools with overlapping names can cause tool-selection failures.

This is agent-computer interface design.

For coding agents, that means:

  • commands have clear names and predictable output
  • test scripts are documented and fast enough to run repeatedly
  • file-editing tools avoid formats that are hard for the model to produce
  • local dev startup has one obvious command
  • browser checks are first-class for UI work
  • logs and traces are queryable without leaving the task
  • dangerous tools are separated from reversible read-only tools

The tool layer should reduce cognitive load for the model in the same way good internal tooling reduces cognitive load for engineers.

3. A sandbox that can be trusted

An agent that can edit code but cannot safely run the app is half blind.

The harness needs an execution environment where the agent can do real work without touching production by accident. That usually means isolated worktrees, local services, throwaway databases or fixtures, seeded test data, environment variables with safe defaults, and permission boundaries around destructive actions.

OpenAI's Codex example is useful here because they made each change bootable and inspectable in isolation. Anthropic's later work on Managed Agents separates the agent into stable concepts: session, harness, and sandbox. That framing is clean. The session is the append-only record. The harness is the loop routing model/tool interaction. The sandbox is where the agent can execute.

Those are different concerns. Coupling them too tightly makes long-running work hard to debug, recover, and scale.

4. State that survives the context window

Long tasks do not fail only because the model is wrong. They fail because the next run does not know what the previous run learned.

Anthropic's long-running harness uses feature lists, progress files, git history, and startup checks so each new session can recover orientation. This is boring in exactly the right way.

The minimum viable version is:

  • a task list with explicit pass/fail status
  • a progress file that records what changed and what remains
  • commits with useful messages
  • a startup script that proves the app still boots
  • a rule that the next agent starts by reading the state before editing

This is not paperwork. It is memory.

If the only durable artifact is the final diff, the harness is forcing future agents to rediscover the reasoning.

5. Checks that make "done" hard to fake

Agents are prone to premature completion. Humans are too, but agents can do it faster.

The harness should make completion expensive to fake and cheap to verify.

For a backend change, that might mean unit tests, integration tests, contract tests, database migration checks, and trace assertions.

For a frontend change, it must include the running product: browser snapshots, responsive viewports, console errors, network failures, screenshots, and user-level flows. A component can type-check and still be visually broken. A page can render and still hide the primary action below the fold.

OpenAI's agent evals documentation and trace grading work are relevant because agent quality often lives in the path, not only the final answer. Did the agent choose the right tool? Did it escalate at the right moment? Did it preserve policy constraints? Did it silently skip a required step?

Output checks are necessary. Trace checks catch a different class of failures.

6. Mechanical boundaries for architecture and taste

Documentation alone will not keep an agent-generated codebase coherent.

OpenAI's article is blunt on this point: architecture and taste need enforcement. They describe custom linters, structural tests, file-size limits, schema naming conventions, structured logging, dependency direction rules, and remediation instructions embedded in lint errors.

This is the part many teams miss.

If a team says "we prefer clean architecture" but the repo allows any file to import any other file, the harness is lying. If a team says "we care about accessibility" but no check catches missing labels, the harness is relying on mood. If a team says "we avoid generic UI" but the evaluator has no criteria for originality, the agent will drift toward the average.

Taste can be partially encoded. Not perfectly, but enough to improve the baseline:

  • naming rules
  • import boundaries
  • design tokens
  • contrast checks
  • screenshot review
  • layout invariants
  • examples of accepted and rejected work
  • evaluator rubrics with calibrated examples

Human judgment still matters. The point is to spend human judgment on the hard calls, not on the same preventable defects every day.

7. Review loops with escalation

OpenAI's agent guide describes agents as systems that use tools within guardrails and can hand control back to a user when needed. That escalation ability is not optional in production systems.

A coding harness should define what the agent can do alone, what requires review, and what must stop for a human.

Examples:

  • changing authentication boundaries
  • loosening permissions
  • deleting production data
  • changing pricing or billing logic
  • sending external messages
  • accepting a large dependency
  • rewriting architecture
  • altering legal, security, or compliance text

The right harness does not make the agent timid everywhere. It makes the agent aggressive in reversible work and conservative at trust boundaries.

A practical harness for a real repo

If I were making a codebase agent-ready, I would not start with a complex swarm. I would start with the smallest harness that makes one strong agent reliable.

| Layer | Artifact | What it prevents | | --- | --- | --- | | Orientation | Short AGENTS.md plus linked docs | The agent starts in the wrong place | | State | Task list, progress notes, git history | Each session rediscovers the work | | Execution | dev, test, build, check, seed scripts | The agent guesses how to run the app | | Product vision | UX, copy, and design criteria | Technically valid but generic output | | Verification | Unit, integration, browser, screenshot, trace checks | The agent declares victory too early | | Boundaries | Linters, schemas, permissions, guardrails | Speed creates architectural drift | | Review | Specialist reviews and human escalation rules | Risky changes merge by momentum | | Memory | Failure notes and recurring cleanup | The same mistake repeats forever |

The order matters. If a single agent cannot reliably complete a well-scoped task with this harness, adding more agents usually multiplies coordination problems.

Multi-agent systems can be valuable. OpenAI describes manager and handoff patterns. Anthropic's application harness uses planner, generator, and evaluator roles. But the win comes from better contracts, not from more actors.

If agents split work without shared context, shared state, and explicit contracts, they create integration debt. If they share everything, they can overload context. Harness engineering is the discipline of deciding what each part sees, what each part owns, and how the next step verifies the previous one.

What teams get wrong

The common failure is treating agent adoption as a tooling purchase.

"We bought the coding agent. Why is the output inconsistent?"

Because the repo is not legible.

"We added a long instruction file. Why does it still miss constraints?"

Because the instructions are not connected to executable checks.

"We asked it to test. Why did it say everything works when the product is broken?"

Because the harness did not force user-level verification.

"We added more agents. Why did quality get worse?"

Because coordination is now the problem.

"We ran evals. Why did production still fail?"

Because the eval looked only at final answers and not at traces, tool choices, state transitions, or real user paths.

The model is usually not the only bottleneck. The system around the model is.

The hiring signal

Harness engineering is becoming a serious hiring signal because it is not a narrow AI trick.

It sits at the intersection of:

  • software architecture
  • developer experience
  • product engineering
  • observability
  • test design
  • security boundaries
  • eval design
  • context engineering
  • human judgment

This is why AI-native engineering should not mean "person who writes prompts." The useful version is closer to staff-level product/platform engineering: someone who can design the loop where humans, models, tools, codebases, and checks compound instead of colliding.

That person can still write code. They just do not treat code as the only lever.

The point

The prompt is a message.

The harness is the working system around the message.

If the harness is weak, the agent may still produce impressive fragments. It may even ship useful changes. But the quality will depend on luck, operator vigilance, and how much hidden context the human keeps re-injecting by hand.

If the harness is strong, the agent can see more of reality. It can recover from failure. It can verify its work. It can leave useful state behind. It can operate inside boundaries. It can improve the environment that improves future runs.

That is the difference between using a coding agent and building with one.

Further reading:

  • Harness engineering: leveraging Codex in an agent-first world
  • Effective harnesses for long-running agents
  • Harness design for long-running application development
  • Building effective agents
  • A practical guide to building agents
  • Agent evals
  • Scaling Managed Agents
article record2026-04-30
author:
petru arakiss
published:
2026-04-30
category:
ai engineering
tags:
harness engineering · coding agents · agentic workflows · codex · claude · context engineering · evals · developer productivity
stats:
2,550 words · 13 min · ~3,443 tokens
source:
www.petruarakiss.com/blog/harness-engineering-for-coding-agents/raw

table of contents