Harness engineering is the missing discipline around coding agents

The prompt is not the product.

That sentence is useful, but it can also hide too much. So let us define the thing properly.

A prompt is the instruction packet a model receives for a run: the task, the surrounding context, constraints, examples, system instructions, and conversation state available at that moment. In a coding agent, the prompt may also include tool descriptions, repository instructions, previous messages, and the current plan.

Prompts matter. Bad prompts produce bad work. But a prompt is still only an input to a loop.

A harness is the loop.

It is the runtime, repository structure, tool interface, sandbox, documentation system, test suite, browser access, observability, evaluation criteria, review process, and human escalation policy around the model. It is the engineered environment that lets an agent make progress without guessing what reality looks like.

That is why harness engineering is becoming more important than prompt engineering for serious coding-agent work.

Why the term matters now

OpenAI's article on harness engineering with Codex is not interesting because it says Codex can write code. Everyone has seen models write code.

It is interesting because OpenAI describes an internal product where Codex wrote not only product code, but also tests, CI, docs, observability, and internal tools. The human role shifted upward: engineers designed environments, specified intent, and built feedback loops so agents could execute reliably.

That is the real claim.

Not "the prompt got better."

The operating system around the agent got better.

OpenAI describes making the application itself legible to Codex: local worktrees that can boot independently, browser control through Chrome DevTools, DOM snapshots, screenshots, navigation, logs, metrics, and traces exposed in local development. Once the agent can see the running product and the telemetry, tasks like startup latency or slow spans become tractable instead of aspirational.

They also describe a repository knowledge system where AGENTS.md is not a giant manual. It is a table of contents. The deeper truth lives in structured docs, architecture notes, design docs, product specs, execution plans, generated schema docs, quality documents, and references that are versioned with the repo.

That detail matters. A large instruction file feels like control, but it usually becomes stale context. A harness uses progressive disclosure: start the agent with a small stable map, then let it discover the relevant source of truth.

Anthropic's work points in the same direction. In Effective harnesses for long-running agents, the central problem is not that Claude lacks intelligence. The problem is continuity. Long tasks span multiple context windows. Each new session needs a way to know what happened before, what remains unfinished, how to run the app, and what counts as done.

Their solution is concrete: an initializer agent creates a feature list, the coding agent works one feature at a time, progress is committed to git, notes are written for the next session, and end-to-end browser testing is made explicit. The model is not asked to "be careful." The harness changes the shape of the work so careful behavior is easier.

In Harness design for long-running application development, Anthropic goes further. They use planner, generator, and evaluator roles, sprint contracts, visual criteria, and browser-driven evaluation to turn subjective work like frontend quality into something the system can actually grade. The important part is not the multi-agent architecture by itself. The important part is that taste, usability, originality, and verification are encoded into a reviewable process.

This is the pattern across serious agent work:

The model is not enough.
The prompt is not enough.
The environment must become legible.
The definition of done must become testable.
The handoff between humans, tools, and agents must become explicit.

That is harness engineering.

Prompt engineering versus harness engineering

Prompt engineering asks:

What should I tell the model?
What examples should I provide?
What tone, format, and constraints should be in the instruction?
How do I reduce ambiguity in this run?

Harness engineering asks:

What should the agent be able to see?
What tools should it be able to use?
What state survives between runs?
What evidence proves the work is correct?
What boundaries should be mechanically enforced?
What must remain a human decision?
What did this failure teach the system for next time?

The prompt is one layer of the harness. It is not the harness.

If an agent cannot find the right file, run the right tests, start the app, inspect the browser, query logs, understand the architecture, or recover from a broken intermediate state, a better prompt only gives you a more articulate failure.

The anatomy of a useful coding-agent harness

A coding-agent harness has several layers. You can build them incrementally, but skipping them explains why many agent demos feel impressive and then fall apart inside real repositories.

1. A repository map, not a sacred manual

Agents need orientation. So do humans.

The weak version is a huge instruction file full of rules, preferences, and outdated warnings. It feels thorough, but it eats context and decays quickly.

The stronger version is a compact entry point:

where the app entry points live
where product specs live
where architecture rules live
how to run checks
what directories are dangerous
where examples of good changes live
where known failure modes are documented

The repo then holds the deeper knowledge as versioned artifacts: architecture docs, execution plans, schemas, design rules, eval definitions, quality notes, and migration history.

This is context engineering at repository scale. The goal is not to put everything into context. The goal is to make the right thing discoverable at the right time.

2. Tools designed for agents, not just humans

Anthropic's Building effective agents makes a point that deserves more attention: tool definitions deserve as much care as prompts.

A command that is obvious to an experienced engineer can be awkward for a model. A relative filepath can become a source of failure after the agent changes directories. A JSON-escaped code field can make a simple edit more error-prone. Two tools with overlapping names can cause tool-selection failures.

This is agent-computer interface design.

For coding agents, that means:

commands have clear names and predictable output
test scripts are documented and fast enough to run repeatedly
file-editing tools avoid formats that are hard for the model to produce
local dev startup has one obvious command
browser checks are first-class for UI work
logs and traces are queryable without leaving the task
dangerous tools are separated from reversible read-only tools

The tool layer should reduce cognitive load for the model in the same way good internal tooling reduces cognitive load for engineers.

3. A sandbox that can be trusted

An agent that can edit code but cannot safely run the app is half blind.

The harness needs an execution environment where the agent can do real work without touching production by accident. That usually means isolated worktrees, local services, throwaway databases or fixtures, seeded test data, environment variables with safe defaults, and permission boundaries around destructive actions.

OpenAI's Codex example is useful here because they made each change bootable and inspectable in isolation. Anthropic's later work on Managed Agents separates the agent into stable concepts: session, harness, and sandbox. That framing is clean. The session is the append-only record. The harness is the loop routing model/tool interaction. The sandbox is where the agent can execute.

Those are different concerns. Coupling them too tightly makes long-running work hard to debug, recover, and scale.

4. State that survives the context window

Long tasks do not fail only because the model is wrong. They fail because the next run does not know what the previous run learned.

Anthropic's long-running harness uses feature lists, progress files, git history, and startup checks so each new session can recover orientation. This is boring in exactly the right way.

The minimum viable version is:

a task list with explicit pass/fail status
a progress file that records what changed and what remains
commits with useful messages
a startup script that proves the app still boots
a rule that the next agent starts by reading the state before editing

This is not paperwork. It is memory.

If the only durable artifact is the final diff, the harness is forcing future agents to rediscover the reasoning.

5. Checks that make "done" hard to fake

Agents are prone to premature completion. Humans are too, but agents can do it faster.

The harness should make completion expensive to fake and cheap to verify.

For a backend change, that might mean unit tests, integration tests, contract tests, database migration checks, and trace assertions.

For a frontend change, it must include the running product: browser snapshots, responsive viewports, console errors, network failures, screenshots, and user-level flows. A component can type-check and still be visually broken. A page can render and still hide the primary action below the fold.

OpenAI's agent evals documentation and trace grading work are relevant because agent quality often lives in the path, not only the final answer. Did the agent choose the right tool? Did it escalate at the right moment? Did it preserve policy constraints? Did it silently skip a required step?

Output checks are necessary. Trace checks catch a different class of failures.

6. Mechanical boundaries for architecture and taste

Documentation alone will not keep an agent-generated codebase coherent.

OpenAI's article is blunt on this point: architecture and taste need enforcement. They describe custom linters, structural tests, file-size limits, schema naming conventions, structured logging, dependency direction rules, and remediation instructions embedded in lint errors.

This is the part many teams miss.

If a team says "we prefer clean architecture" but the repo allows any file to import any other file, the harness is lying. If a team says "we care about accessibility" but no check catches missing labels, the harness is relying on mood. If a team says "we avoid generic UI" but the evaluator has no criteria for originality, the agent will drift toward the average.

Taste can be partially encoded. Not perfectly, but enough to improve the baseline:

naming rules
import boundaries
design tokens
contrast checks
screenshot review
layout invariants
examples of accepted and rejected work
evaluator rubrics with calibrated examples

Human judgment still matters. The point is to spend human judgment on the hard calls, not on the same preventable defects every day.

7. Review loops with escalation

OpenAI's agent guide describes agents as systems that use tools within guardrails and can hand control back to a user when needed. That escalation ability is not optional in production systems.

A coding harness should define what the agent can do alone, what requires review, and what must stop for a human.

Examples:

changing authentication boundaries
loosening permissions
deleting production data
changing pricing or billing logic
sending external messages
accepting a large dependency
rewriting architecture
altering legal, security, or compliance text

The right harness does not make the agent timid everywhere. It makes the agent aggressive in reversible work and conservative at trust boundaries.

A practical harness for a real repo

If I were making a codebase agent-ready, I would not start with a complex swarm. I would start with the smallest harness that makes one strong agent reliable.

| Layer | Artifact | What it prevents | | --- | --- | --- | | Orientation | Short AGENTS.md plus linked docs | The agent starts in the wrong place | | State | Task list, progress notes, git history | Each session rediscovers the work | | Execution | dev, test, build, check, seed scripts | The agent guesses how to run the app | | Product vision | UX, copy, and design criteria | Technically valid but generic output | | Verification | Unit, integration, browser, screenshot, trace checks | The agent declares victory too early | | Boundaries | Linters, schemas, permissions, guardrails | Speed creates architectural drift | | Review | Specialist reviews and human escalation rules | Risky changes merge by momentum | | Memory | Failure notes and recurring cleanup | The same mistake repeats forever |

The order matters. If a single agent cannot reliably complete a well-scoped task with this harness, adding more agents usually multiplies coordination problems.

Multi-agent systems can be valuable. OpenAI describes manager and handoff patterns. Anthropic's application harness uses planner, generator, and evaluator roles. But the win comes from better contracts, not from more actors.

If agents split work without shared context, shared state, and explicit contracts, they create integration debt. If they share everything, they can overload context. Harness engineering is the discipline of deciding what each part sees, what each part owns, and how the next step verifies the previous one.

What teams get wrong

The common failure is treating agent adoption as a tooling purchase.

"We bought the coding agent. Why is the output inconsistent?"

Because the repo is not legible.

"We added a long instruction file. Why does it still miss constraints?"

Because the instructions are not connected to executable checks.

"We asked it to test. Why did it say everything works when the product is broken?"

Because the harness did not force user-level verification.

"We added more agents. Why did quality get worse?"

Because coordination is now the problem.

"We ran evals. Why did production still fail?"

Because the eval looked only at final answers and not at traces, tool choices, state transitions, or real user paths.

The model is usually not the only bottleneck. The system around the model is.

The hiring signal

Harness engineering is becoming a serious hiring signal because it is not a narrow AI trick.

It sits at the intersection of:

software architecture
developer experience
product engineering
observability
test design
security boundaries
eval design
context engineering
human judgment

This is why AI-native engineering should not mean "person who writes prompts." The useful version is closer to staff-level product/platform engineering: someone who can design the loop where humans, models, tools, codebases, and checks compound instead of colliding.

That person can still write code. They just do not treat code as the only lever.

The point

The prompt is a message.

The harness is the working system around the message.

If the harness is weak, the agent may still produce impressive fragments. It may even ship useful changes. But the quality will depend on luck, operator vigilance, and how much hidden context the human keeps re-injecting by hand.

If the harness is strong, the agent can see more of reality. It can recover from failure. It can verify its work. It can leave useful state behind. It can operate inside boundaries. It can improve the environment that improves future runs.

That is the difference between using a coding agent and building with one.