How is agent evaluation actually different from model evaluation?

Model evaluation scores a single input-output pair: given this prompt, was the answer right? Agent evaluation has to score a whole sequence of decisions — which tool the agent called, what it did with the result, how it recovered from a bad step. The same final answer can come from sound reasoning or from luck, and only the trajectory tells you which. That distinction is invisible at the output layer, which is exactly where most teams look.

If perfect trajectory specs are impossible for open-ended tasks, what should teams actually specify?

Specify invariants and forbidden states rather than the single correct path. You usually can't enumerate every valid way to accomplish an open-ended task, but you can state what must always hold (the agent never acts outside the authorized account, never fabricates a citation, never issues a refund above its limit) and what must never happen regardless of the path taken. Those are testable, and they catch the failures that actually cost you money.

Why does a successful demo carry so little information?

A demo is one sample from an unknown distribution. The rule of three from applied statistics puts a 95% upper bound on your failure rate at roughly 3/n after n clean trials, so with n=1 a single green run is consistent with a true failure rate anywhere up to about 95%. You cannot distinguish a 40%-reliable agent from a 99%-reliable one on a single run, which is why 'it worked in the demo' should barely move your beliefs.

Applied AI

You Can't Evaluate an Agent You Can't Specify

Enterprise agent pilots stall at "impressive demo, never shipped" because teams score final answers while agents operate on trajectories — path-dependent decision sequences where one demo tells you almost nothing.

By MehdiJuly 4, 20268 min read

On this page

Why the final answer lies to you
An old discipline wearing new clothes
The playbook
The honest boundary

Most enterprise agent pilots die in the same place: an impressive demo, then nothing ships. The usual explanation — the models aren't good enough yet — is mostly wrong. The real reason is that the teams building these systems have no reliable way to tell a good agent from a lucky one, because they are evaluating at the wrong level. They score the final answer. The agent lives in the sequence of decisions that produced it. Those are not the same object, and treating them as the same is why the pilot never survives contact with production.

Start with what changed when we moved from models to agents. Evaluating a model is, at bottom, a function test. You have an input, an output, and a notion of correctness: classify this image, translate this sentence, answer this question. You build a labeled set, you run it, you get a number. The number is honest because the unit of work — one input, one output — is exactly the unit you scored. There is no hidden state between the prompt and the answer.

An agent breaks that clean correspondence. It decides which tool to call, reads the result, decides what to do next, and repeats, ten or thirty or a hundred times, before it emits anything a human sees. The unit of work is no longer input-to-output. It is a trajectory: a path-dependent sequence of decisions where each step conditions every step after it. And a trajectory has three properties that make final-answer scoring almost useless.

Why the final answer lies to you

The first property is compounding. Errors in a sequence don't add, they multiply. Take an agent that makes twenty sequential decisions and gets each one right 95% of the time. Naively that sounds excellent. But if the steps are roughly independent, end-to-end success is 0.95²⁰ ≈ 0.36. A 95%-reliable step, run twenty times, produces a 36%-reliable agent. Push per-step reliability to 99% and you get 0.99²⁰ ≈ 0.82: better, still nowhere near shippable for anything that touches money or records. Real agents aren't perfectly independent, and a well-built one can sometimes detect and recover from a bad step, which pushes the number back up. But the direction is the point. The cost of an early mistake is not local. An agent that retrieves the wrong customer record on step three can execute steps four through twenty flawlessly and still be catastrophically, confidently wrong. Final-answer scoring sees one failure. The mechanism that produced it — a recoverable early error the agent never caught — is invisible unless you look at the path.

The second property is that multiple valid paths exist. For most non-trivial tasks there is no single correct trajectory. An agent asked to reconcile two datasets might join them directly, or dedupe first, or spot-check a sample before committing, all legitimate. This kills the naive fix of diffing the trajectory against a golden path. There is no golden path. There is a space of acceptable paths and a much larger space of unacceptable ones, and your evaluation has to distinguish the regions, not match a string.

The third property is the one that quietly destroys demos: the same final answer can come from sound reasoning or from unsound reasoning, and the two are indistinguishable at the output. An agent can arrive at the right number by doing the analysis correctly, or by pattern-matching to something it saw in context and getting lucky. On the demo input, both produce a green check. On the next input, the sound one holds and the lucky one collapses. This is a right answer for the wrong reasons, and it is not a rare edge case. It is the default failure mode of a system optimized against a small, visible set of examples.

Which is why "it worked in the demo" carries almost no information. A demo is a single sample from an unknown distribution. The rule of three from applied statistics puts the 95% upper bound on your failure rate at roughly 3/n after n clean trials. With n = 1, that bound is 95%. A single green demo is consistent with a true failure rate anywhere up to 95%. You cannot distinguish a 40%-reliable agent from a 99%-reliable one by watching it work once, and yet a watched-it-work-once demo is precisely the evidence on which most "let's pilot this" decisions get made.

An old discipline wearing new clothes

I spent years in a lab where the entire job was refusing to let a good-looking result count as knowledge. You do not get to publish because the numbers came out the way you hoped. You have to have specified, in advance, what would count as success and what would count as failure, and then rule out the possibility that you got the outcome you wanted for a reason other than the one you're claiming. Agent evaluation is the same discipline, and teams that ship agents on vibes are making the two specific errors that separate science from anecdote.

The first error is scoring what you never specified. You cannot evaluate a system against a success criterion you have not written down as a testable statement, and "it should do the right thing" is not a testable statement. This is not a productivity tip; it is the load-bearing skill, and it is why problem specification outlasts prompt engineering — the models keep changing, but the discipline of stating precisely what "correct" means for this task is what makes any evaluation possible at all. Most teams skip it because at the level of a demo it feels unnecessary. The agent obviously did the right thing; you just watched it. But "obviously" is doing all the work, and "obviously" does not scale to ten thousand inputs you will never personally watch.

The second error is crediting a good outcome to a good process without controlling for luck. When an agent succeeds, the tempting inference is: it succeeded, therefore its reasoning is sound, therefore it will succeed again. That is a causal claim, and it is exactly the kind of claim you are not allowed to make from a single observation without ruling out the confounders. This is the causal-inference problem hiding inside every AI decision: the outcome you observe is the joint product of the agent's competence and the input's difficulty and whatever happened to be in context, and unless you vary those independently you cannot attribute the win to the agent. A right answer on an easy, in-distribution input tells you the agent handles easy, in-distribution inputs. It is silent on everything else.

The playbook

Here is what evaluating at the trajectory level actually requires. None of it is exotic. All of it is skipped.

Write trajectory-level acceptance criteria, not just output checks. For each task type, specify properties the path must satisfy, not only the answer. Did the agent verify the account was authorized before acting on it? Did it check its own output before committing? Did it stay inside the tools it was permitted to use? These are the acceptance tests. The final answer being correct is one criterion among several, and often not the most important one. An agent that reaches the right answer by acting on data it had no authority to touch has failed, whatever the output says.

Evaluate on a distribution, never on cherry-picked cases. The demo set is the enemy. You need a task set that spans the difficulty range and the input variety the agent will actually meet, including the ugly, out-of-distribution, adversarial cases, because those are where the lucky-reasoning agents fall apart and the sound ones earn their keep. Report the pass rate across the distribution with its uncertainty, not the count of impressive individual runs. One number over a hundred varied tasks is worth more than a hundred anecdotes.

Separate outcome-correctness from process-soundness, and measure both. These are two different axes and conflating them is the core mistake. An agent can be right for wrong reasons (correct outcome, unsound process, will fail next time) or wrong for right reasons (sound process, bad outcome because the task was genuinely impossible or the data was wrong). You want to promote the second and eliminate the first, and you can do neither if your dashboard collapses both into a single pass/fail.

Build a regression suite from every failure you have ever seen. Every production failure is a specification you didn't know you needed. The moment an agent does something bad, that trajectory — its inputs, its context, the forbidden thing it did — becomes a permanent test case. This is how the eval set stops being a static artifact and becomes a ratchet: it can only get stricter, and every past failure is guaranteed never to ship silently twice.

Measure the cost of failure modes, not just the pass rate. A 90% pass rate is meaningless until you know what the 10% costs. Ten percent of tasks where the agent asks a clarifying question instead of proceeding is a rounding error. Ten percent where it silently issues wrong refunds is a company-ending liability. Building Kommerce, a commerce operating system for markets where trust is the scarce resource and a single mishandled cash-on-delivery order can burn a customer permanently, taught me that failures are not fungible. The pass rate treats them as if they were. Weight your evaluation by the actual downside of each failure mode, because that is the number that decides whether you can ship.

The honest boundary

You cannot write a perfect trajectory spec for an open-ended task. If the task is "research this market and write a memo," there is no enumerable set of correct paths, and anyone selling you a complete specification of one is selling you a fiction. This is a real limit, and pretending otherwise just relocates the vibes.

But the achievable version is far stronger than what most teams do, which is nothing. You cannot specify the single right path; you can specify the invariants every acceptable path must preserve and the forbidden states no path may ever enter. The agent may reason its way to the memo however it likes, but it must never cite a source that does not exist, never present a projection as a fact, never act on data outside its authorized scope. Those are testable. They are the guardrails, not the route. And the failures that actually kill agents in production are almost never "chose a suboptimal path." They are "entered a state that should have been impossible." Specify the cannot-happens even when you cannot specify the shoulds, and you have converted most of the risk into something you can measure.

The teams stuck at "impressive demo, never shipped" are not there because their agent is weak. They are there because they have no answer to the question the buyer always eventually asks: how do you know it will do this again, on an input you have not seen, for the reason you think? A demo cannot answer that. Only a specification can.

Frequently asked questions

How is agent evaluation actually different from model evaluation?: Model evaluation scores a single input-output pair: given this prompt, was the answer right? Agent evaluation has to score a whole sequence of decisions — which tool the agent called, what it did with the result, how it recovered from a bad step. The same final answer can come from sound reasoning or from luck, and only the trajectory tells you which. That distinction is invisible at the output layer, which is exactly where most teams look.
If perfect trajectory specs are impossible for open-ended tasks, what should teams actually specify?: Specify invariants and forbidden states rather than the single correct path. You usually can't enumerate every valid way to accomplish an open-ended task, but you can state what must always hold (the agent never acts outside the authorized account, never fabricates a citation, never issues a refund above its limit) and what must never happen regardless of the path taken. Those are testable, and they catch the failures that actually cost you money.
Why does a successful demo carry so little information?: A demo is one sample from an unknown distribution. The rule of three from applied statistics puts a 95% upper bound on your failure rate at roughly 3/n after n clean trials, so with n=1 a single green run is consistent with a true failure rate anywhere up to about 95%. You cannot distinguish a 40%-reliable agent from a 99%-reliable one on a single run, which is why 'it worked in the demo' should barely move your beliefs.

Filed under Applied AI. AI that ships, not AI that demos.

Essays like this, in your inbox.

You Can't Evaluate an Agent You Can't Specify

Why the final answer lies to you

An old discipline wearing new clothes

The playbook

The honest boundary

Frequently asked questions

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

Your AI Agent Has No Skin in the Game, and That's the Real Ceiling on Autonomy

Why the final answer lies to you#

An old discipline wearing new clothes#

The playbook#

The honest boundary#

Frequently asked questions

Keep reading

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

Your AI Agent Has No Skin in the Game, and That's the Real Ceiling on Autonomy

Why the final answer lies to you

An old discipline wearing new clothes

The playbook

The honest boundary