Yue Song

Agents Are Not Prompts

Mar 28, 2026

AutoHarness and Anthropic point to the same lesson: agent performance depends less on prompt cleverness than on the harness that constrains, evaluates, and carries work forward.

Most agent failures do not look like deep failures of intelligence. They look like system failures.

A model takes an illegal action. A coding agent loses the thread halfway through a task. A generator declares success without testing the thing a user would actually touch. In each case, the problem is less “the model could not think” than “the system let it act badly, forget state, or approve itself too early.”

Over the past few weeks, two pieces of work sharpened that view for me.

The first is the paper AutoHarness: improving LLM agents by automatically synthesizing a code harness, by Xinghua Lou and colleagues. It is a research result, but a very practical one: instead of fine-tuning the model, they improve agent performance by synthesizing code around the model that constrains and validates what it can do.

The second is Anthropic’s engineering write-up Effective harnesses for long-running agents. That piece is an engineering account of how they structured long-running coding-agent workflows so work could survive across many context windows.

These are different kinds of work. One is a paper, the other is an engineering field note. My takeaway from reading them together is:

The performance of an agent is often dominated less by the model in isolation than by the system wrapped around it.

That is the shift I care about here. If that framing is right, then the main unit of agent engineering is not the prompt. It is the harness.

The Wrong Mental Model

A lot of agent discourse still assumes a simple progression:

better model -> better agent
better prompt -> better output

That mental model produces exactly the kinds of systems people complain about:

  • fragile workflows
  • hallucinated actions
  • unreliable execution
  • premature “done” states

The more useful model looks more like this:

agent performance = model x harness

And by harness, I mean everything around the model:

  • constraints
  • execution environment
  • validation logic
  • evaluation system
  • memory structure
  • control flow

Once you see that, a lot of current agent behavior stops being mysterious.

Agents Usually Fail at the System Boundary

The most striking example in the AutoHarness paper is not about bad strategy. It is about invalid action.

In the paper’s abstract, the authors note that in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Their point is not that the model lacked chess intelligence in some broad sense. It is that a large share of failure came from producing actions that the environment simply would not accept. That is a systems problem before it is a reasoning problem.

Anthropic describes a similar pattern from a different angle. In their long-running coding setup, the failures were not mainly “the model had no idea what to do.” Instead, the system kept breaking down around execution rhythm and state management:

  • the agent tried to do too much at once
  • it ran out of context mid-implementation
  • the next session inherited half-finished work
  • later sessions sometimes looked around, saw partial progress, and declared the project complete

This is an important reframing. Agents often do not fail because the base model is incapable in the abstract. They fail because the surrounding system lets them take invalid actions, lose state, skip verification, or self-certify progress too early.

That is why I increasingly think the core insight is this:

Agents fail at the system level, not only at the model level.

What a Harness Actually Is

The cleanest way to think about it is:

LLM = reasoning engine
Harness = execution system

The harness defines:

  • what actions are allowed
  • how actions are validated
  • how failures are handled
  • how state is maintained
  • how progress is evaluated

Without that layer, the LLM is just a stochastic generator. It may still be useful, but it is not yet an engineered agent.

This is exactly what makes the AutoHarness paper interesting. The authors are not mainly trying to make the model “think harder.” They let the model synthesize a code harness through iterative refinement using feedback from the environment. In other words, the improvement happens by changing the system that mediates between model and world.

Anthropic arrives at the same place from practice. Their initializer agent writes durable scaffolding such as init.sh, a progress log, and a structured feature list. Subsequent sessions do not begin by improvising from a giant prompt. They begin by reading artifacts, choosing one feature, making incremental progress, and leaving the repo in a clean state for the next session.

That is harness design.

From Token Loops to Executable Policy

One of the most important ideas in AutoHarness is a move away from raw token-by-token acting and toward code.

The naive pattern is:

LLM -> decide next step -> execute -> repeat

The AutoHarness direction is closer to:

LLM -> generate program or harness
Program -> executes task under constraints

That is a major shift:

  • from token-level decisions
  • to program-level execution

The benefits are obvious once stated:

  • more deterministic behavior
  • fewer hallucinated actions
  • less need to call the model at every tiny decision point
  • logic that can be tested and debugged as code

The paper pushes this all the way to “entire policy in code” in some settings, eliminating the need to use the LLM at decision time for those tasks. I do not think every agent will or should go that far. But the direction matters. It suggests that a meaningful part of agent intelligence can be compiled out of the prompt loop and into executable policy.

Why does that matter so much? Because once policy lives in code, the system can validate, retry, and refine behavior in a way a prompt alone usually cannot.

Another way to say this is that good agent systems do not treat the model as a free actor. They treat it as a constrained generator.

At a high level:

while True:
    action = llm()
    if valid(action):
        break

The exact implementation can vary. The validation function might be handwritten, synthesized, environment-derived, or shaped by repeated feedback. But the principle is the same:

  • do not trust raw output
  • validate before execution
  • let constraints carry part of the intelligence

That is what the chess example makes so vivid. If 78% of losses come from illegal moves, then a system that blocks illegal moves can create a massive capability jump without changing the base model at all.

This is also why prompts alone are often a dead end. A prompt can ask for care. A harness can enforce legality.

And once the model can iteratively improve a code harness using environment feedback, the search process is no longer just “try a different phrasing in the prompt.” It becomes a search over executable structures:

  • rules
  • validators
  • policies
  • wrappers
  • programs

Prompts are fragile. Programs are composable. Prompts are hard to inspect after the fact. Programs can be tested, versioned, and compared.

That changes the engineering surface. Instead of spending all your effort on better wording, you can do something more like:

generate program
-> run it
-> observe failure
-> refine program
-> repeat

That is a form of self-improvement without retraining. It is test-time optimization driven by the environment rather than by gradients.

Long-Running Agents Need Externalized State

Anthropic’s post makes a different but complementary point: even a strong model with context compaction is not enough for long-running work.

Their diagnosis is blunt. Compaction helps, but it is not sufficient. A coding model given only a high-level prompt will often try to one-shot the whole app, lose the thread when context runs out, and leave the next session to reconstruct what happened. Later, another session may misread visible progress as completion.

The fix is not “ask the model more nicely.” The fix is to move state out of the model’s transient context and into durable artifacts.

In Anthropic’s setup, that included:

  • an init.sh script so future sessions know how to boot the environment
  • a claude-progress.txt file so sessions can inherit recent work
  • a structured feature list so progress is represented explicitly rather than guessed
  • git commits so the system can revert bad states and recover good ones

This is one of the most important operational insights in the whole article. State does not belong only in chat history. It belongs in files the next session can read, inspect, and continue from.

In practice, that suggests a general pattern:

  • spec.md
  • progress.md
  • tasks.json
  • test_results.json

The exact filenames do not matter much. The principle does.

This is also why aggressive context reset can be healthier than endless context compression. Instead of asking one growing conversation to carry the whole system forever, a better pattern is often:

new session
-> load durable artifacts
-> continue from explicit state

That reduces drift, cuts accumulated noise, and makes the next step depend on files and tests rather than on a model’s fading memory of earlier turns.

Evaluation Has to Touch Reality

Anthropic also makes a point that should be obvious, but often is not: agents are bad at self-approval.

In their coding setup, absent explicit prompting, Claude would often make code changes and maybe run unit tests or curl, but still miss that the feature did not actually work end-to-end. Performance improved substantially once the system explicitly used browser automation tools and tested “as a human user would.”

That matters far beyond web apps.

Evaluation is not:

  • reading code and deciding it “looks right”
  • reasoning abstractly about likely behavior
  • letting the same generator declare success

Evaluation has to touch reality:

  • browser automation
  • API calls
  • CLI execution
  • end-to-end flows
  • environment feedback

If a generator can build, evaluate, and approve its own work inside one loop, it will often rationalize problems away. A stronger pattern is:

generate -> evaluate -> reject or accept -> improve -> repeat

Anthropic’s article does not literally present a formal planner -> generator -> evaluator diagram, but I think it supports that engineering pattern. One component turns a vague request into a structured feature list and setup artifacts. Another component implements one feature at a time. A separate testing layer checks behavior in the real environment. That separation matters because the generator should not be the final judge of its own success.

More broadly, this is a good rule for agent design: separate thinking from acting whenever the environment allows it. Let one component propose. Let another critique or evaluate. Let the system enforce constraints. Do not ask a single undifferentiated loop to think, act, approve, and remember everything.

Incrementality Is a Harness Decision

One of Anthropic’s best practical lessons is that decomposition is not just a planning preference. It is a harness choice.

Their initializer agent expands the user request into a structured feature list. Later agents are prompted to choose a single feature that is not yet done, work on it, verify it carefully, and leave the environment clean.

This matters because one of the most common agent pathologies is trying to do everything at once. That tendency is not corrected by a vague instruction to “be methodical.” It is corrected by a system that makes incremental work the default path.

So I think it is useful to say:

decomposition is scaffolding, not magic

It helps because it gives the system a workable execution grain. Over time, better models may need less of that scaffolding in some domains. But today, for long-running tasks, harness-enforced chunking is often what keeps the whole process from collapsing.

File-Based Communication and Explicit Contracts

One theme both pieces push toward is that file-based communication beats hidden conversational state for serious work.

Instead of:

agent -> message -> agent

use:

agent -> writes file
agent -> later session reads file

That yields several practical benefits:

  • persistence
  • inspectability
  • versioning
  • reproducibility

It also creates something agent systems badly need: explicit contracts.

A good contract says:

  • what the input is
  • what the expected output is
  • what counts as success
  • what must be validated before progress can be marked complete

You can do that for tools. You can do that for evaluators. You can do that for inter-agent handoffs. The point is to move critical assumptions out of vibes and into structure.

The Harness Is the Real Intelligence Layer

Once you put these pieces together, the conclusion becomes hard to avoid.

The old paradigm was:

better prompt -> better result

The emerging paradigm is:

better system -> better result

Where “system” includes:

  • harness
  • constraints
  • evaluation
  • memory
  • execution

That is what AutoHarness shows from the research side. A smaller model can outperform a larger one when it synthesizes a strong enough harness or even an executable policy. And that is what Anthropic shows from the engineering side. A frontier model in a raw loop is not enough; the agent needs explicit initialization, incremental task structure, durable artifacts, context resets, and real end-to-end testing.

So when I say “agents are not prompts,” what I mean is simple:

agents are systems

And the intelligence of that system is encoded not just in the model, but in the harness that surrounds it.

Three Immediate Upgrades

If I had to reduce this to three practical rules, they would be these:

  1. Validate before execution. Never run raw LLM output directly when the environment can enforce legality or correctness.

  2. Separate generation from evaluation. Do not let the same loop build, judge, and approve without external checks.

  3. Externalize state. Put durable memory into files, logs, tests, and artifacts instead of hoping the next context window will reconstruct the world correctly.

Closing

I think this is the real engineering shift in agents.

The last phase of the conversation was dominated by prompts: better instructions, better phrasing, better prompting tricks. The next phase is much more like systems engineering. How do you stop invalid actions before they happen? How do you preserve state across sessions? How do you force evaluation to touch the real world? How do you keep an agent from declaring victory too early?

Those are harness questions.

And they matter because the difference between a toy agent and a useful one is often not whether the model sounds intelligent in a single turn. It is whether the surrounding system can keep that intelligence legible, constrained, and recoverable over time.

References

The ideas in this post are from the two sources; Codex helped me summarized it.

If you'd like to follow what I'm learning about AI tools and workflows, you can subscribe here → Subscribe to my notes