AI Engineering

Reference monitors for coding agents

Coding-agent policy belongs at the tool boundary: every relevant action mediated, outside agent control, and small enough to test.

/11 min read

If a coding agent can run shell commands, edit files, call MCP tools, open browsers, request tokens, start local servers, and push branches, the security boundary is no longer the chat box.

The boundary is the operation.

This is why the old reference monitor idea keeps coming back when I think about coding agents. The term is not new agent vocabulary. It comes from operating-system security. In the 1972 Computer Security Technology Planning Study, James P. Anderson describes a reference monitor as the concept that validates references by programs to programs, data, or devices against authorized types of access. NIST's current glossary keeps the same shape: a reference validation mechanism is always invoked, protected from tampering, and small enough to analyze and test.

That maps too cleanly to coding agents to ignore.

A coding-agent reference monitor is the enforcement point between the agent and the operation it wants to perform. It does not ask whether the agent sounds careful. It asks what action is about to happen, under which identity, against which object, with which policy, in which task context, and whether that action is allowed.

Prompts are still useful. They are not the enforcement layer.

The old requirement still fits

The classical requirements are blunt:

  • Every relevant action must be mediated.
  • The monitor and its policy must not be editable by the watched process.
  • The mechanism must be small enough to test and reason about.

Anderson's report used users, programs, files, devices, and access matrices. Saltzer and Schroeder's "The Protection of Information in Computer Systems" gives the broader protection vocabulary: principals, protected objects, capabilities, permissions, revocation, confinement, and protected subsystems. The nouns changed. The shape did not.

For coding agents, the subject might be:

  • the agent session
  • the human account the agent acts for
  • the harness or IDE surface that launched it
  • the MCP client
  • the local tool server
  • the GitHub token, cloud token, database URL, or SSH identity available to the process

The object might be:

  • a file path
  • a repository
  • a branch
  • a production database
  • a package manager install target
  • a browser origin
  • an MCP resource
  • a secret-bearing environment variable
  • a GitHub issue, pull request, release, or repository setting

The reference is the action:

  • file.write to a path
  • bash with a parsed command vector
  • git push origin main
  • curl to an external host
  • mcp.call against a tool name and resource
  • vercel --prod
  • supabase db push
  • gh repo edit
  • opening a browser URL that can mutate account state

That action needs mediation before it happens. After-the-fact summaries are useful for review, but they are not access control.

Prompt policy fails the reference monitor test

A system prompt can tell an agent not to delete files, not to push to production, not to expose secrets, and not to run unknown installers. That matters. It changes model behavior.

It still fails as a reference monitor.

First, it is not complete mediation. The policy text is considered when the model plans or generates a tool call, but the actual operation may happen through a different path: a shell wrapper, an MCP server, a browser session, a package script, a CI job, a GitHub Action, or a generated helper script. If the dangerous operation is not intercepted at execution time, the prompt did not mediate it.

Second, it is not tamper resistant in the right sense. The agent can edit repository docs, test scripts, generated wrappers, local policy files inside the workspace, and sometimes the very command that will be used to perform the operation. A policy that lives in the same editable layer as the work is guidance, not enforcement.

Third, it is too large and semantic to verify. "Be careful with production" is not a testable authorization rule. "Do not mutate protected branches unless the current command has an explicit signed approval for scope git.push.protected" is testable. The difference matters.

The model can still refuse. It can still explain risk. It can still ask for confirmation. Those are behavioral controls. The reference monitor is the thing that runs even when the model is confused, rushed, overconfident, or tricked by context.

Raw allowlists are not enough

Many agent systems start with an allowlist:

  • allow git
  • deny rm
  • allow curl
  • deny sudo
  • allow mcp.read
  • ask for mcp.write

That is better than nothing. It is also too crude.

git status and git push origin main are both git. curl https://docs.example.com and curl https://example.com/install.sh | sh are both curl. rm ./tmp/output.txt and rm -rf /Users/name are both rm. A browser click on a documentation page and a browser click on "delete project" are both browser actions.

A useful monitor has to classify the operation, not only the binary or tool.

For coding agents, classification needs at least:

  • tool name and subcommand
  • normalized paths and repository root
  • network destination and protocol
  • working tree state
  • branch and remote
  • credential class available to the process
  • target environment, such as local, preview, staging, production
  • task intent, when it is narrow enough to bind safely
  • approval state and expiration
  • whether the operation reads, writes, deletes, publishes, deploys, or changes access

The policy result should be deterministic. Natural language can describe why a rule exists, but the enforcement path should reduce to a concrete decision: allow, deny, require approval, or require a safer form of the command.

This is the line I care about in Gommage: the agent's tool call is mapped to a capability before it executes, and policy is evaluated outside the agent's normal reasoning loop. The implementation can be incomplete. The architectural boundary is the point.

MCP makes the boundary bigger

MCP is where this becomes less theoretical.

The MCP security guidance now has to talk about confused deputy problems, token passthrough, SSRF, session hijacking, local server compromise, OAuth URL validation, stdio proxy risks, and scope minimization. Those are not prompt-style concerns. They are authority-boundary concerns.

A local MCP server can run with the same privileges as the client. A remote MCP server can sit between a client and third-party APIs. A session event can affect offered tools. An overbroad token can turn a small task into lateral access. A tool description can hide behavior that the agent will happily call because it looks useful.

The reference monitor question is simple:

Was this operation mediated by something outside the untrusted path?

For a local MCP install, that means the client should show the exact command before executing it, require explicit consent for dangerous local process execution, and sandbox the server where possible. For remote authorization, it means per-client consent, precise scopes, token audience checks, redirect validation, and logs that preserve which client did what. For tool invocation, it means a policy decision on the concrete tool call, not only on the fact that "MCP is allowed."

MCP moves authority into more local and remote process boundaries. The monitor has to follow those boundaries instead of treating tool access as one global yes or no.

Harnesses make agents useful. Monitors keep authority bounded.

OpenAI's Codex harness engineering write-up describes the work of making applications, logs, metrics, traces, repository knowledge, tests, and review loops legible to agents. Anthropic's work on long-running agent harnesses and application-development harnesses points in the same direction: feature lists, progress files, git history, browser testing, sprint contracts, evaluator passes, and explicit evidence.

That is harness engineering. It makes agent work possible to steer and inspect.

A reference monitor is narrower. It does not decide whether the app is good. It decides whether a proposed action may execute.

Those layers should work together:

  • The task contract says what the agent is trying to do.
  • The harness gives the agent tools, context, tests, browser access, and evidence paths.
  • The monitor decides whether each action is allowed under policy.
  • The trace records what actually happened.
  • The reviewer or evaluator decides whether the evidence is enough.

Putting all of that in one prompt blurs responsibilities. The prompt can ask for discipline. The harness can create better work loops. The monitor has to enforce authority.

The useful design shape

A coding-agent reference monitor does not need to be huge. It needs a hard position in the execution path.

The minimum shape I trust looks like this:

  1. The agent proposes an operation through a structured tool interface.
  2. The host converts that operation into a normalized action envelope.
  3. A small policy engine maps the envelope to a capability.
  4. The policy engine checks task context, workspace context, target context, and approval context.
  5. The host executes only if the decision allows it.
  6. The decision and operation are logged outside the workspace.
  7. The agent cannot edit the monitor, the policy source loaded for the run, or the log after the fact.

The envelope matters. Without it, policy becomes string matching. String matching breaks quickly when shell quoting, package scripts, generated commands, symlinks, redirects, or nested tool calls enter the system.

The log matters too. A final answer saying "I did not touch production" has low evidentiary value. A signed or append-only decision log showing that no production capability was granted has higher value. It still needs review, but at least it is evidence from the control layer rather than prose from the actor.

Failure modes

The failures are predictable.

Incomplete mediation: one tool path goes through policy and another path bypasses it. The shell is monitored, but a package script mutates the same target. File writes are monitored, but browser clicks are not. MCP calls are monitored, but local server startup is not.

Editable policy: the agent can change the YAML, config, prompt fragment, wrapper script, or generated allowlist that the monitor trusts. This includes "helpful" changes where the agent loosens a rule to finish a task.

Semantic fog: the rule says "dangerous operations require approval" without defining dangerous. The model explains why an operation is safe, and the host accepts that explanation as policy.

Oversized trusted core: the monitor becomes a general orchestration engine, reads half the repo, calls LLMs for classification, loads plugins dynamically, and becomes too large to test. At that point it may still be useful software, but it no longer gives reference-monitor assurance.

Missing object model: the system has no stable vocabulary for files, branches, environments, secrets, sessions, tools, scopes, and targets. Without objects, there is no clean authorization question.

No audit boundary: the same agent that performs the operation can rewrite the evidence. That collapses enforcement and narration into one actor.

What I would measure

The tests should be boring:

  • Does every tool invocation pass through the monitor?
  • Can the agent reach the same effect through an unmonitored path?
  • Can the agent modify the loaded policy for its current run?
  • Can the agent modify or delete the decision log?
  • Do protected branch pushes fail without the right approval?
  • Do production deploys fail without the right approval?
  • Do destructive file operations fail outside an allowed temporary directory?
  • Do curl | sh, cloud metadata requests, and private network requests fail by default?
  • Do broad MCP scopes require elevation instead of being granted at startup?
  • Does a denied action leave enough evidence for a human to understand what was blocked?

This is not a replacement for sandboxing, tests, code review, or human judgment. It is the piece that stops those other controls from depending on the agent's self-description.

The claim

Coding agents make the old reference monitor requirement more useful, not less.

The agent can reason. The agent can inspect code. The agent can write tests. The agent can drive browsers. The agent can call tools that touch real systems. That power makes prompt-only safety weaker as the system gets better.

The control that matters most should sit where the authority changes hands.

For coding agents, that is the tool boundary.

claims
  1. 01A coding-agent reference monitor belongs at the tool boundary, not inside the prompt.
  2. 02The old reference monitor requirements still apply: complete mediation, tamper resistance, and a small enough mechanism to test.
  3. 03An allowlist is weak when it ignores task intent, workspace state, target path, credentials, escalation history, and operation semantics.
  4. 04Prompt refusals and final answers are not evidence because the agent can be wrong about what it did.
  5. 05MCP and local tool servers expand the need for reference-monitor semantics because they move authority out of chat and into process execution.

Plain markdown, if you need to quote or verify it.

view raw