AI Engineering

The Operational Meta-Harness

Databricks Omnigent made meta-harness visible. The durable principle is narrower: the system that can fail should not be its own judge.

Jul 8, 2026(updated)

/31 min read

Start with the principle, because the term is negotiable:

Do not let the system that can fail be its own judge.

Not of what it is allowed to do. Not of what it claims it did. Not of what it remembers. Not of which harness changes deserve acceptance.

The architecture I keep needing in agentic software engineering is not the model, and it is not even the agent. It is the harness. And once agents ship with their own harnesses, the next abstraction is the harness above the harness: an operational meta-harness.

The phrase needs a careful definition, because it can easily sound like one more layer of agent hype. I mean something narrower:

An operational meta-harness is the layer above existing agent harnesses that governs the work path from human intent to external evidence: routing, context, execution boundaries, policy decisions, trace capture, verification, acceptance, rollback, and controlled workflow evolution.

A harness for harnesses, in other words, or in plainer terms the layer that turns powerful agent sessions into a governable engineering system. Codex CLI already has a harness, and so does Claude Code. Cursor, Gemini CLI, local coding agents, MCP-based agents, hosted agents, and future tools ship their own opinionated ones. So the question is no longer only how to prompt the model better, or even how to give the agent better context. It is this: once the agent can act, who governs the conditions under which it acts? That is the layer I want to name.

Scope note: this is a working operational definition, not a claim of invention. The term already has another use in optimization research, and Databricks has made the product category visible with Omnigent. The claim here is about the boundary I need for real engineering work.

That boundary has an important split inside it. The whole meta-harness is not a reference monitor. Most of it is orchestration, evidence capture, review workflow, and compatibility plumbing. Useful, but too large to treat as a trusted computing base. The trusted core has to stay small: deterministic policy at the action boundary, tamper-resistant audit, and operator-owned eval gates that the system being evaluated cannot rewrite. The larger architecture earns trust by routing work through that core and preserving evidence around it, not by inheriting the authority of the reference monitor as a label.

The model is not the production unit

Most public discussion still starts with the model. Which one is smarter, which writes better code, which has the bigger context window, which follows instructions, which can reason longer? Those questions matter, but they do not describe real agentic software engineering. A raw model does not operate a repository, choose a branch, isolate a worktree, or decide which files are safe to mutate. It does not maintain an audit trail, know when a human must approve a dangerous action, or define what "done" means for a task. A model predicts, an agent uses tools, and a harness is what makes that tool use operational.

A production system needs more than intelligence. It needs boundaries, contracts, state, permissions, evidence, recovery paths, and governance. That is why the harness, not the model, is the production unit. And once the harness itself becomes a component inside a larger workflow, the governance unit becomes the meta-harness.

From prompting to context to harnesses

Practical LLM engineering has moved through a sequence of increasingly externalized control layers. Prompt engineering asked how to phrase the task so the model gives a better answer, and it mattered when usage was mostly conversational or single-shot and the model was treated as an intelligent text generator whose main lever was instruction. But a better prompt cannot fix stale knowledge, inspect a repository, enforce permissions, run tests, or carry project memory across sessions. As soon as the work grew larger than the prompt, the question moved on.

Context engineering asked how to put the right knowledge in front of the model at the right time: documentation extraction, RAG, markdown knowledge bases, project docs, session summaries, style guides, API references, architecture decisions, memory files. The goal was not to prompt better but to build a working cognitive environment around the model. Then agents became capable enough to read files, edit code, run commands, call tools, inspect logs, and iterate, and the question changed again: what operating environment lets this agent do useful work without turning into chaos? That is harness engineering.

A harness is the structure around the agent that makes operation useful, bounded, observable, and repeatable. It can include tool and file access, sandboxing, command execution, context injection, memory, policies, approvals, worktrees, task state, logging, test execution, CI integration, retry loops, output contracts, and human review. Tooling is part of it, but the harness is the whole operating envelope: it defines the conditions under which the agent works. Coding agents now ship with their own. Codex has its session and execution model, sandboxing, approvals, managed configuration, hooks, telemetry, and code-editing behavior; Claude Code has its tool loop, hooks, permissions, skills, subagents, memory, and plugins; Cursor has its editor-integrated runtime; MCP servers expose external tools behind another protocol surface. So the question moves up a level: how do you govern multiple harnesses as components of one system? That is meta-harness engineering, and it is not about replacing Codex or Claude Code but about operating them.

It is also not an excuse to skip learning their native controls. Before building anything above a harness, configure the one you already have: sandbox mode, approval policy, hooks, managed configuration, skills, MCP servers, telemetry, memory, working-directory rules, whatever extension points the tool exposes. That work is not beneath the thesis. It is the base layer. The meta-harness is the higher-order layer that decides what enters the agent, what context it gets, what is allowed, what must be recorded, what must be reviewed, what counts as success, and when the system should stop.

This is no longer only theoretical. OpenAI describes Codex surfaces as powered by the same Codex harness, the agent loop underneath its web, CLI, IDE, and app experiences. LangChain puts it bluntly: if it is not the model, it is the harness. GitHub already uses "agent control plane" language for enterprise AI controls, sessions, audit logs, and MCP policies. The vocabulary is converging on one fact: the model is not the system.

Wrapper, orchestrator, control plane, harness, meta-harness

The term matters because the nearby words are close but not equivalent. A wrapper calls another tool: a script that runs codex exec "fix this issue" is convenient, but it does not define policy, state, evidence, verification, or governance. An orchestrator coordinates work, splitting tasks, dispatching jobs, calling agents, collecting outputs, and chaining steps, but it can do all of that badly, moving work around without knowing whether the work is safe, auditable, reproducible, or acceptable. A control plane governs resources and configuration, which is the right frame for infrastructure, permissions, queues, users, metrics, and policies; an agent control plane may be part of a meta-harness, but the phrase does not preserve the link to harness engineering, and it is already taken. GitHub's enterprise AI Controls are explicitly an agent control plane: centralized policy, session visibility, audit events, custom agents, MCP allowlists. That is one real part of the governance story, not the whole of it.

I keep "operational meta-harness" because the layer is not only administration over a fleet. It is also the executable path from human intent to context package, selected harness, worktree, policy decision, trace, verification artifact, acceptance decision, and cleanup. A harness surrounds an agent and makes it operational, providing tools, context, permissions, memory, execution, feedback, and limits; it is the local operating envelope. A meta-harness sits above harnesses and treats them as execution engines. It does not merely call them. It governs them. A wrapper calls, an orchestrator coordinates, a control plane configures, and a meta-harness governs.

This is different from optimization-oriented meta-harnesses

There is already another valid use of the term. In March 2026, the paper Meta-Harness: End-to-End Optimization of Model Harnesses used it for an outer-loop system that searches over harness code, framing the harness as the code that decides what to store, retrieve, and present to the model, then optimizing it against tasks and traces. That is a real meaning, just not the one I need. An optimization-oriented meta-harness treats the harness as the object to improve and asks how to find a better one automatically. The operational meta-harness treats the harness as the object to govern: it assumes useful harnesses already exist, shipped by tools like Codex, Claude Code, Cursor, MCP servers, or internal runtimes, and asks how to operate them safely, repeatably, observably, and with human accountability. Optimization meta-harnesses improve harnesses; operational ones govern them. The two can coexist, and a mature operational system may eventually optimize parts of itself, but the meanings should not be collapsed. This article argues for the operational one.

After Databricks Omnigent

This article was first published on June 7, 2026. On June 13, 2026, Databricks announced Omnigent, an open-source meta-harness for combining, controlling, and sharing agents across tools such as Claude Code, Codex, Pi, and custom agents. That makes the term more visible, and it also makes the distinction sharper.

Omnigent is a concrete product direction for the layer above agent harnesses: a shared interface, server, policies, cloud execution, collaboration, and session sharing around multiple agents. The operational definition here is the architectural boundary underneath that product category. It asks what must be governed when agent harnesses become execution engines: intake, routing, context, permissions, evidence, audit, memory, verification, rollback, and workflow evolution.

The useful distinction is scope. A product like Omnigent can implement parts of a meta-harness. An operational meta-harness is the control layer as a system property. It can include a hosted product, a local CLI, native Codex and Claude Code configuration, MCP gateways, policy engines, CI, trace capture, signed audit, and human review. The thing to avoid is reducing the term to a wrapper around agents. If the layer cannot say what is allowed, what evidence counts, who approved risk, what changed, and how to recover, it is orchestration without enough governance.

That boundary also separates the term from the now-common phrase agent control plane. A control plane governs agents as resources: identity, inventory, policies, sessions, MCP access, audit, and visibility. An operational meta-harness governs agent work as a process: task intake, context, harness selection, execution, evidence, acceptance, rollback, and workflow evolution. In a mature product those responsibilities may live together. Architecturally, they are not the same thing.

This is also different from tuning a native harness

There is a more practical distinction too. Most operators should not start by building a meta-harness; they should first use the harness in front of them properly. If you use Codex, configure Codex: learn its sandbox modes and approval policies, use managed requirements for admin-enforced constraints, hooks where the hook surface is enough, MCP allowlists where the native configuration supports them, telemetry where you need usage and tool-decision evidence. If you use Claude Code, configure Claude Code: permissions, hooks, settings, skills, subagents, MCP configuration, monitoring, and project instructions, before pretending the tool needs an external control layer.

This is not a small point, because optimizing a harness from the inside is a different job from governing harnesses from the outside. Inside-harness optimization asks:

How far can this agent's own configuration, hooks, permissions, memory, skills, MCP settings, and telemetry take me?
Which workflow failures are solved by using the native API correctly?
Which custom scripts should disappear because the host now supports the behavior directly?
Which policies belong in the host's managed configuration rather than in an external wrapper?

That work is valid and often the right answer. It is also humbling, because many things that look like architecture are just missing configuration. The operational meta-harness begins where the native boundary becomes visible:

cross-agent policy that must apply to Codex, Claude Code, Cursor, and MCP tools
evidence that must survive outside any one agent transcript
approvals that must be out-of-band from the agent's own conversation
policy tests that must run without launching the agent
routing across multiple harnesses
worktree, branch, sandbox, and CI conventions shared across tools
audit formats reviewed independently of the host vendor
deprecation rules for when native improvements make external scaffolding obsolete

The claim is not that existing harnesses are inadequate, or that every team needs another layer. It is that once harnesses become powerful execution engines, some organizations and serious solo operators may need an operational layer above them, one that uses native capabilities wherever it can and governs only what the native harness cannot or should not own alone.

Below that threshold, the meta-harness is overhead. If one operator uses one agent in one repository and the native harness already gives enough permissions, logs, tests, and review, adding a second-order layer mostly adds latency and maintenance. The argument starts paying for itself when there are several harnesses, shared policy, out-of-band approval, audit requirements, cross-tool evidence, or workflow evolution that has to be accepted without trusting the agent's own narrative.

It also creates real failure modes. A central governance layer can become a single point of failure. Every update to Codex, Claude Code, Cursor, MCP, or a local runner can break an adapter or invalidate a policy assumption. Speaking several harness dialects at once is not free. A good meta-harness has to price that compatibility debt explicitly, delete parts when native harnesses make them redundant, and keep the trusted core smaller than the surrounding workflow system.

Better models do not remove governance

A common objection: if models keep improving, does all this external workflow become obsolete? Partly yes, partly no, and the split is the point. Some layers around models exist because models are weak, and others exist because models are strong.

Capability scaffolding compensates for what the model cannot yet do reliably: injecting fresh documentation by hand, writing helper scripts because the agent cannot navigate well, keeping ad hoc context files because the model forgets constraints, over-explaining framework APIs because its knowledge is stale, guiding every edit because it cannot preserve structure. This layer should be deprecated aggressively as the model or native harness absorbs the capability. Anthropic makes the point directly: in Harness design for long-running application development the author removes pieces of the harness as newer models handle more natively, and in Scaling Managed Agents it goes further, noting that harnesses encode assumptions that go stale as models improve. A good meta-harness should not defend yesterday's scaffolding as sacred architecture.

Context scaffolding is a different thing: it carries project-specific knowledge, because no general model automatically knows the local truth of a repository, organization, convention, business rule, or historical tradeoff. A stronger agent uses that context better but still has to get it from somewhere. Execution scaffolding is different again: it defines the operating theater, the worktrees, branches, isolated environments, test commands, CI gates, deployment previews, rollback paths, task queues, and artifact capture. Models can drive those systems better over time, but someone still has to define them, and the stronger the agent, the more it matters that execution happens inside a controlled theater.

Governance scaffolding exists precisely because the model is capable: permissions, policy-as-code, human approvals, signed grants, audit logs, traceability, security boundaries, escalation paths, evidence retention, acceptance criteria, rollback authority, post-hoc review. A model that cannot do much needs little governance. A model that can edit, execute, inspect, call tools, mutate state, push branches, open PRs, touch infrastructure, and coordinate other tools needs a great deal of it. So:

The better the agent gets, the less you need capability scaffolding, but the more you need governance scaffolding.

That is the answer to the obsolescence objection. Many workflow components should die. Governance is not one of them, and it grows more important as models get stronger.

A meta-harness should make workflow evolution governable

A second objection: every new model or agent changes the workflow, so shouldn't everything be rethought constantly? Yes, and that is an argument for a better meta-harness, not against one. A serious operational layer should not freeze a workflow; it should make workflow evolution governable. Every new model release, runtime, hook API, MCP capability, sandbox mode, or tool surface should trigger a reassessment of what can be deprecated, what is now native, what still needs an external policy layer, what should move into the host tool, what must stay outside the agent for safety or auditability, what evidence proves the new flow is equivalent or safer, which old assumptions are now false, and what new risk the new capability introduces.

That is what separates a meta-harness from a pile of scripts: knowing which pieces are temporary compensations, which are local context, which are execution structure, and which are governance invariants. Without that discipline, agent tooling becomes a museum of old model limitations, where old prompts, context hacks, scripts, warnings, and workarounds all linger until the system is heavy, superstitious, and hard to reason about. But deleting everything is just as dangerous, because some of it is not a hack. A stronger model can retire an old context hack. It does not retire audit, permissions, rollback, or human accountability.

What an operational meta-harness contains

An operational meta-harness is an architectural layer rather than a single binary or product. Parts of it may live in local CLIs, CI, policy engines, GitHub Actions, hooks, MCP gateways, review bots, dashboards, state stores, audit logs, or human approval flows. What matters is the role each part plays, not where its code runs. The subsystems are recognizable.

Task intake decides what work enters the system. A GitHub issue, local prompt, CI failure, alert, or operator command is not automatically safe to delegate, so intake asks what repo it touches, whether it is scoped, which agent fits, what context it needs, and what acceptance contract applies. The context compiler then builds the package the agent receives: relevant files, architecture docs, issue text, prior decisions, failing test output, policy constraints, recent diffs, known pitfalls. Its job is not to dump everything but to supply enough local truth without flooding the agent.

The agent router chooses the execution engine. Codex may suit a repo edit, Claude Code an exploratory refactor, a local model a classification, a static analyzer a deterministic check, and a human an ambiguous architectural decision. Routing is not only about model quality; it weighs risk, cost, permissions, context, latency, and evidence. The execution theater then prepares the environment: branch, worktree, container, sandbox, temporary home, clean dependency install, limited token scope, restricted network, seed data, rollback path. The agent should work inside that theater instead of undefined space.

The policy gateway decides which actions are allowed, mapping observed tool calls to capabilities and evaluating policy, so it can deny a dangerous action, request approval, or record a signed decision. That is the deterministic part. The human approval flow handles the exceptions, and a mature system does not force every unusual action into a binary allow or deny: it supports bounded exceptions with exact scope, limited TTL, a use count, a reason, an approval record, revocation, and an audit trail. The verification layer checks the output with tests, lint, type checks, security scans, policy fixtures, snapshots, integration runs, benchmarks, and browser flows, because the agent's claim is not enough. The acceptance layer then decides whether the work is done: did the requested change happen, were forbidden changes avoided, did the diff stay in scope, did the checks pass, was risk introduced, is human review required. That acceptance decision can use deterministic evidence, but it is not the same thing as policy evaluation. It often contains judgment about product intent, risk, scope, and review.

The audit and replay layer records what happened and lets it be reconstructed later: logs, signed decisions, policy hashes, command output, diffs, artifacts, state snapshots, approval records, replay tools. And the evolution layer, the most underrated one, tracks when a part of the workflow should be deprecated or replaced, knowing which pieces exist because of current model limits and which are enduring governance boundaries.

This is also where an agent control plane and an operational meta-harness diverge. The control plane is where policy, visibility, session management, fleet configuration, and administration live. The meta-harness is the broader operating layer that turns human intent into governed agent work and then turns the resulting activity into evidence, acceptance, rollback, and workflow evolution. In a given product they may be the same system; architecturally they are not identical.

Why native agent permissions are necessary but not sufficient

If Codex and Claude Code already have permissions, why add another layer? Not because native permissions are useless. They are valuable and should stay enabled. Codex's Agent approvals and security docs treat sandbox mode and approval policy as separate layers, one for what the agent can technically do and one for when it must ask, alongside OS-level sandboxing, network policy, MCP and app approvals, automatic review, and opt-in telemetry. Claude Code's hooks are deterministic lifecycle commands that enforce rules, format code, block protected files, reinject context, and audit configuration. The native harnesses are getting stronger, which is good.

The operational meta-harness exists because they are not the whole operating system of the workflow. Native permissions are usually local to the agent runtime: hard to review outside the tool, dependent on transcript state, often in the wrong evidence format, rarely sharing a policy language across agents, and not built for organization-level review, signed audit, reproducible policy tests, or cross-agent governance. So a mature setup composes layers:

keep native sandboxing and approvals
add external policy where reproducibility and auditability matter
isolate risky execution at the OS or container level
keep human approval out of the agent transcript
preserve evidence independently of the agent's narrative

That is defense in depth. The agent's harness is one layer; the operational meta-harness governs the stack.

The watcher cannot be something the watched can edit

There is a classic name for what a strong control has to be: a reference monitor. The idea is old, described by Anderson in 1972. A reference monitor mediates every relevant action, cannot be tampered with by the thing it watches, and is small enough to actually verify. A native hook fails that test the moment the agent can edit the config that defines it, or route work through a path outside its invocation point.

I expanded that boundary in Reference monitors for coding agents, because this is the security line that makes the rest of the meta-harness claim testable.

But the claim has to stay narrow. The meta-harness as a whole is not a reference monitor, because it is too large: intake, context compilation, routing, execution setup, dashboards, trace viewers, acceptance review, and workflow evolution are all useful, and all fallible. They should be tested, versioned, audited, and improved, but they should not be trusted as if they were the small enforcement mechanism. The reference-monitor bar belongs to the pieces that actually mediate authority: action-boundary policy, audit records the agent cannot quietly rewrite, and admission gates such as operator-owned eval suites.

This is the part operators underestimate. The hook does not fail because the model forgets it; on the paths it covers, it is deterministic and it runs. The agent simply drifts off those paths. In a long session, with context already compressed, the model can wander into a route the hook does not cover, or quietly touch the config that defines it, usually without intent. The hook is still there. The agent just is not on the path it guards anymore.

That is why this matters even for well-behaved agents. A malicious-agent threat model is the easy case: do not run malicious agents. The more useful case is ordinary drift from well-intentioned agents. The control has to live somewhere the agent cannot reach because a watcher the watched can modify is not a watcher. It is also the line between automation and enforcement. A hook that lints, formats, or runs tests inside the harness is fine, because convenience does not have to be tamper-proof. A control you lean on for safety has to satisfy the reference-monitor properties, and an in-harness hook cannot, once the agent is strong enough to touch its own configuration.

Gommage as one layer of the meta-harness

Gommage should not be framed as the whole meta-harness; that would overclaim. The precise claim is narrower:

Gommage is one concrete layer inside an operational meta-harness: deterministic policy, approval, and audit for AI coding agent tool calls.

It sits between an agent and the operation the agent wants to perform, maps observed tool calls to capabilities, evaluates declarative policy, and can allow, deny, or ask, with signed and bounded break-glass grants and signed audit evidence. It is deliberately not a sandbox, and that boundary matters: a hook is not a kernel, a policy engine is not syscall mediation, and a signed audit log is not process isolation. Gommage does not replace Codex's native controls, Claude Code's native controls, or OS-level confinement; it composes with them. OS confinement controls what the process can do at the system level, native permissions control what the agent runtime exposes or asks about, Gommage controls deterministic policy and audit at the tool-call boundary it can observe, human approval handles exceptions out-of-band, CI and tests validate the resulting code, trace capture records what actually happened, and the meta-harness governs how those layers fit together. That makes Gommage a concrete public example of the thesis, a product-shaped fragment of the stack rather than the whole of it.

slod as the evidence layer

The same thesis appears from another angle in slod, the trace tool formerly called Traceframe. Gommage answers what an agent is allowed to do; slod answers what it actually did, and that distinction matters because the agent's explanation is not the source of truth. A transcript, a shell log, or a final answer is not enough. An operational system should be able to reconstruct:

what task was assigned
what context was provided
which agent or harness ran
what tools were invoked
what was allowed or denied
what required approval
what files changed
what tests ran
what failed and what succeeded
what was accepted
what was rolled back

This is reconstruction, not only prevention. Without it, each agent session is an anecdote; with it, sessions become operational events, and agent activity becomes reviewable evidence rather than a story you have to take on faith.

Nahuali as the memory layer

Gommage governs what an agent may do and slod records what it did, but there is a third question that is easy to miss: what does the agent remember, and how much of it should the caller trust? An agent that runs across sessions accumulates a store, and most memory layers keep that store flat, so a thing the user said two minutes ago and a thing the model inferred a month ago sit side by side with equal authority.

The usual upgrade is a confidence score that drops when memory contradicts itself. That beats flat, but it hides the same failure as self-policing rules: if the model that wrote the memory is also the thing that scores it, the signal is circular, and a model that hallucinates with confidence will score its hallucination high. Self-scoring memory is self-policing in another costume, the system that can drift grading its own recollection. So confidence should not be a bare number the model hands you. It should be auditable over evidence: where the fact came from, how old it is, what supports it, what contradicts it, and an explicit rule for what happens when two memories collide, because lowering both scores does not tell you which one loses. Making the score deterministic does not fully fix it either, since the judgment just moves to the schema, the contradiction detector, or the resolution policy. But that move is the point, because it pushes the judgment out of the model's own narrative and into a place you can inspect.

That is what nahuali is for. It treats memory as something you audit, not something you assume, surfacing evidence and health behind a recall so a caller can see why a piece of memory should or should not be trusted, with an optional tamper-evident ledger underneath so the recorded past cannot be rewritten quietly. The memory layer should show evidence before trust, not hand you a number and ask for faith. That completes a pattern: Gommage governs rules, slod records evidence, and Nahuali audits memory. The principle underneath all three is the same. Do not let the system that fails be its own judge, not of what it is allowed to do, not of what it claims it did, and not of what it remembers.

Greco: optimization inside an operational frame

Greco is my experiment in whether a coding-agent harness can measurably improve itself. It is where I am testing the coexistence claim from the optimization section. The model is frozen. The unit of evolution is the harness modification: a typed, layered, reversible change to the control plane around the model. Today that means cached procedures and subagent prompts. Settings, hooks, and the deeper modification layers are still roadmap.

The relevant part for this essay is the containment mechanism. Session traces expose friction. The agent proposes a modification. The modification is checked against an operator-owned evaluation suite that the system being graded cannot edit. If the measured baseline-vs-candidate delta clears the deterministic gate, the modification can become active. The operator does not approve each proposal one by one. The operator defines the experiment, owns the eval suite, sets the budgets, and audits aggregate behavior.

The modification record is append-only in practice: proposed, validated, active, rejected, and retired artifacts remain available for audit. The loop has strict budgets, a freeze switch, one-command rollback, and an acceptance gate that can only make admission stricter. It cannot loosen its own grading criteria to make a bad change pass.

That read-only eval suite is the reference-monitor line in another form. A harness that can rewrite the tests used to admit its own changes is grading its own homework. Greco avoids that specific failure by keeping the evaluation suite outside the system under test.

That does not make the entire Greco loop trusted. The small trusted part is the operator-owned eval and admission gate. The proposer, cached procedures, prompts, reports, and modification records are ordinary harness machinery around it: useful, inspectable, and fallible.

Greco is still embryonic. It is a single-operator alpha, not a product. The governance loop is built and exercised, but the central measurement is only half-wired: with the current always-pass eval suite, the measured improvement is zero, so the autonomous loop applies nothing. That is the correct result. The experiment is useful only if it can be declared false when the evidence does not hold.

Determinism belongs to policy, not every judgment

At the model layer, non-determinism is unavoidable: the model may produce different plans, different edits, different explanations, different recoveries. Not every layer should inherit that. Policy evaluation should be boring. Given the same tool call and the same policy, the decision should be the same. That separates two questions: what the agent wanted to do, and whether current policy allowed the action. The desired action can be fuzzy. The policy decision should be reproducible.

Acceptance is different. "Did this introduce risk?", "is this the change the operator asked for?", and "does this need product or security review?" are not pure policy questions. They may be assisted by models, reviewers, tests, and checklists. The governance mistake is not using judgment; it is hiding judgment inside the deterministic core and then pretending the whole layer is boring. A serious system keeps policy decisions reproducible, makes acceptance decisions attributable, and preserves the evidence that led to both.

That is also why signed audit matters, because the agent's post-hoc explanation is not evidence. A serious system should be able to answer:

what action was attempted
what capability it mapped to
what policy version was active
what decision was made
whether a grant was used, and who approved it
when it happened
whether the log was tampered with
how to replay or explain the decision later

That is not model intelligence. It is operational accountability.

Human approval should be out-of-band

Approvals should not live inside the agent's own conversational channel. If the transcript that produced the risky action is also where its approval is negotiated, the boundary is weak: the agent can frame the request, pressure the user, omit context, and make the action sound routine. The approval path should be separate. Out-of-band means the human sees the request through a different wire, a TUI, dashboard, local command, webhook, signed request, review queue, or CI gate. The point is separation, not ceremony: the agent may request, the policy layer may escalate, and the human approves through a channel the agent does not control. That is a governance boundary.

The agent's narrative is not the source of truth

A recurring failure in agentic engineering is treating the agent's explanation as evidence. The agent says it did the work, the tests passed, the change is safe, the file was untouched, the policy was followed. But the narrative is not the source of truth. Operational evidence is:

git diff
test output
command logs
audit entries
policy decisions
signed grants
CI status
trace files
file hashes
deployment records
human approvals
reproducible checks

The meta-harness exists partly to move trust from narrative to evidence, and this matters more as agents get more fluent, because the better the agent explains itself, the easier it is to mistake a convincing explanation for a verified state. The question is not whether the agent sounded right. It is what evidence exists outside its prose.

MCP and tool surfaces make the problem larger, not smaller

MCP matters because it standardizes how models and agents connect to external tools, data, and services, and in doing so it enlarges the governance problem. The authorization specification gives transport-level authorization for HTTP-based MCP servers through OAuth flows and resource binding, which helps with server access and token handling. The tools specification makes the trust boundary explicit: tools are model-controlled, and clients are expected to expose tool calls to users, confirm sensitive operations, validate results, and log usage. But transport authorization is not operational acceptance. An MCP client may be authorized to call a server without every call being appropriate for the current task, repo, branch, approval state, or risk boundary. The meta-harness has to reason about tool calls as work events, not just protocol messages. Native layers matter; they are not the whole governance layer.

The public claim I can defend

The claim I can defend publicly is not that I invented the meta-harness, which is both unnecessary and easy to attack. It is this:

Do not let the system that can fail become its own judge. An operational meta-harness is one architecture for applying that rule to agentic software engineering: policy, evidence, memory, acceptance, and evolution live outside the agent that performs the work.

That sidesteps a fight over terminology. What matters is not whether the word is new but whether the architectural shape is real, and it is. OpenAI's Harness engineering with Codex describes a team shifting human work toward environments, intent, and feedback loops while Codex writes the code, tests, CI, docs, observability, and internal tooling. Anthropic's long-running agent work points the same way: agents need structured artifacts, state handoff, feature tracking, evaluation, and harness iteration across context windows. Codex and Claude Code are not raw models, they are agent harnesses, and Gommage, slod, and Nahuali are not smarter agents, they are policy, evidence, and memory layers around agent operation. That is the shape.

Agentic software engineering is moving from isolated agent sessions toward governed systems of agent harnesses. The existing tools already provide powerful internal harnesses; at the current frontier, the next architectural layer is an operational meta-harness, a second-order control layer that governs which harness runs, with what context, under which permissions, with what evidence, and against which acceptance criteria. Better models will obsolete some capability scaffolding and increase the need for governance scaffolding. If more governance later migrates into native harnesses, the reference-monitor question still remains: can the watched thing edit the guard that claims to watch it? The meta-harness does not make the agent smarter. It makes the agent system governable.