Skip to content

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

The binding constraint on autonomous agents isn't intelligence — it's that per-step success probabilities multiply. A 95%-reliable agent finishes a 20-step task 36% of the time. The fix is topology, not IQ.

By Mehdi10 min read
Share
On this page

The reason your agent nails the demo and falls apart in production is not that the model got dumber between the two. It is that reliability decays exponentially with the number of steps, and nobody in the room did the arithmetic. An agent that is right 95% of the time on any single step will complete a 20-step task correctly about 36% of the time. Not 95%. Not 90%. Thirty-six percent.

That number is just 0.95 raised to the 20th power. If each step is an independent gate the agent has to pass, and it passes each with probability 0.95, then it passes all twenty with probability 0.95²⁰ ≈ 0.358. The chain is only as strong as the product of its links, and products of numbers less than one shrink fast.

This single fact — not context windows, not reasoning traces, not tool-use finesse — is the binding constraint on autonomous agents today. Once you internalize the multiplication, most of the agent-hype discourse resolves itself, and so does the question of where to actually spend engineering effort.

The multiplication is the whole story

Let me make the decay concrete, because the shape of the curve is the argument.

Take a genuinely strong per-step reliability, 99%. That is better than most production LLM pipelines achieve on real, uncurated steps. Run it out:

  • 10 steps: 0.99¹⁰ ≈ 0.904 — 90%, fine.
  • 20 steps: 0.99²⁰ ≈ 0.818 — 82%, getting nervous.
  • 50 steps: 0.99⁵⁰ ≈ 0.605 — 61%, a coin flip with a slight edge.
  • 100 steps: 0.99¹⁰⁰ ≈ 0.366 — you are now failing most of the time with a 99%-reliable agent.

Sit with the 50-step line. Ninety-nine percent per step feels like a solved problem. You would ship a component that works 99 times out of 100. But string fifty of them together with no recovery and the composite system fails almost 40% of the time. The individual pieces are excellent and the whole is unreliable, and both statements are true simultaneously. That gap is where founders lose months, because they keep tuning the pieces.

Now hold the horizon fixed at 20 steps and vary per-step reliability, because this is where the "just wait for a smarter model" thesis goes to die:

  • 90% per step: 0.90²⁰ ≈ 0.122 — 12%.
  • 95% per step: 0.95²⁰ ≈ 0.358 — 36%.
  • 97% per step: 0.97²⁰ ≈ 0.544 — 54%.
  • 99% per step: 0.99²⁰ ≈ 0.818 — 82%.
  • 99.9% per step: 0.999²⁰ ≈ 0.980 — 98%.

Read the middle of that table carefully. Going from 95% to 97% per step is, locally, a two-percentage-point improvement — the kind of thing a model release note buries in a benchmark table. End to end, it moves your task from 36% to 54%. A 50% relative improvement in shipped reliability from a marginal per-step gain. This is why "the model got smarter" sometimes produces a discontinuous jump in agent usefulness and sometimes produces nothing you can feel: near the top of the curve, small per-step gains are enormously amplified by the exponent, and far from the top, they are swamped by the chain length. The derivative of pⁿ with respect to p is n·pⁿ⁻¹, so the same per-step improvement matters more the longer the chain and the higher the baseline. The leverage is real, but it is sublinear and it gets expensive exactly where you need it, because buying reliability from 99% to 99.9% is a different order of engineering than buying it from 90% to 95%.

Why demos dazzle and production disappoints

The demo is not lying to you, exactly. It is running a short, curated chain. A demo is three-to-five steps, hand-picked so each step is a high-probability gate, on inputs the builder has already seen the agent handle. Call it five steps at 98% — that is 0.98⁵ ≈ 0.90. Nine times out of ten it works, and the tenth time you re-run it before anyone is watching. The demo is a survivorship-filtered sample of a high-reliability regime.

Production is a 40-step chain over inputs nobody curated, where step 23 hits a malformed address, an ambiguous instruction, a tool that returns a slightly different schema than last week. Each of those novel steps quietly drops per-step reliability from 98% to, say, 92%, and the exponent does the rest. The agent didn't get worse. The chain got longer and the inputs got wilder, and the multiplication was always going to punish both. The gap between demo and production is not a gap in intelligence. It is the difference between 0.98⁵ and 0.93⁴⁰: ninety percent against five and a half.

That is the entire "agents aren't ready" debate compressed into two exponents. And it explains why the debate is so confused: the optimists are extrapolating from the short-chain regime and the skeptics are living in the long-chain one, and both are looking at the same model.

The two escapes

There are exactly two ways out, and only one of them is affordable.

The first is to push per-step reliability toward 1. This works — 99.9% per step gets you to 98% over 20 steps — but the cost of reliability is convex. Each nine you add (99% → 99.9% → 99.99%) costs more than the last, because you are now fighting the long tail of weird inputs and the model's own irreducible error floor. You can spend your whole runway buying nines and still be one bad step away from a failed task. This is the strategy that treats the problem as a capability problem, and it is the one most teams default to because it feels like the ambitious choice. It is mostly a way to convert money into asymptotes.

The second escape is to change the topology of the task, and it is where the winning architectures live. The insight is small and it is the most important thing in this essay: a verified checkpoint resets the product of probabilities to 1.

Here is the mechanism. The 0.99⁵⁰ = 61% catastrophe assumes error accumulates monotonically across all fifty steps with no recovery — one long, brittle chain. Now insert a verification gate every ten steps: a step that checks whether the work so far is correct, and if it is not, discards the bad attempt and retries the segment. You have turned one 50-step chain into five independent 10-step segments, each guarded by a gate.

Do the arithmetic. Each segment is 0.99¹⁰ ≈ 0.904 reliable on a single attempt. Give each segment one retry when its gate reports failure — two attempts total. A segment now fails only if both attempts fail: 1 − (1 − 0.904)² = 1 − 0.096² ≈ 0.9908. Five such segments in series: 0.9908⁵ ≈ 0.955.

You just moved the same task, with the same per-step reliability, from 61% to 95.5% — not by making the model smarter by a single point, but by rearranging where the errors are allowed to accumulate and giving the system permission to try again. The checkpoint is doing what raising per-step reliability from 99% to 99.9% would have cost you a quarter to achieve, and it is doing it with an if-statement and a retry.

That is the real design principle: winning agent architectures treat error as inevitable and spend their engineering on detection and recovery, not on raw capability. Short horizons between gates. Verifiable checkpoints that catch a bad state before it propagates. Retries on the segments that failed for recoverable reasons. Each verified gate is a firewall that stops the product-of-probabilities from compounding past it. You are not preventing errors — you cannot — you are preventing them from traveling.

The boundary conditions, stated honestly

A sharp engineer is already objecting, and the objections are correct, so let me take them head-on before they cost you trust.

The naive multiplication assumes independence and no recovery. Both assumptions are usually false, and they fail in opposite directions.

Independence fails because errors are correlated. A task that is genuinely hard — an ambiguous spec, a truly novel input — tends to fail every attempt, not a fresh 9.6% of them. When failures are correlated, retries buy you much less than the clean 1 − (1 − p)² formula suggests, because the second attempt is drawing from the same poisoned well as the first. This is the crucial caveat on the checkpoint math: retries only rescue transient, stochastic failures, not systematic ones. If the agent fails a step because the model fundamentally cannot do it, you can retry until you are bankrupt. So the real engineering is triage — separating the recoverable failures (a flaky tool call, a formatting slip, a one-off hallucination) from the systematic ones (the task is out of distribution), and routing only the former into a retry loop. A retry loop that cannot tell the difference just burns tokens on impossible steps.

Recovery also requires idempotency, and this is where naive retry architectures corrupt state. If step 12 sent an email or charged a card, "retry the segment" is not a reset — it is a double-send. I spent enough time at Kommerce building the state machine behind a cash-on-delivery order to know that "just re-run it" is a sentence written by someone who has never had to reconcile a double-charge. A checkpoint can only reset the probability product if the work behind it can be safely re-run or cleanly rolled back. Building idempotent, replayable steps is unglamorous plumbing, and it is the actual precondition for the topology escape to work at all. Most of the difficulty of production agents lives here, not in the prompt.

And correlation cuts the other way too, in your favor: on an easy task, the steps are correlated toward success, and the composite reliability is far better than the pessimistic product implies. The multiplication is a worst-case model for adversarial, independent, unrecoverable chains. Real systems live somewhere between that worst case and the correlated-success best case, which is exactly why measuring where your system actually sits matters more than the toy number.

The verifier is the hard part, and it is a specification problem

Notice what the entire topology escape rests on: a gate that can tell whether the work so far is correct. Every claim I just made about resetting the probability product assumes you can build the verifier. If your checkpoint cannot reliably detect a bad state, it is not a firewall — it is a rubber stamp that passes corrupted work downstream with false confidence, which is strictly worse than no gate, because now you trust it.

This is where most agent systems quietly fail, and the failure is a specification problem before it is a modeling one. To verify a step, you must first be able to say precisely what "success" for that step means — and if you cannot specify it, you cannot check it, cannot gate on it, and cannot retry against it. This is the same wall you hit when you try to measure the thing at all, which is why you can't evaluate an agent you can't specify: the eval and the runtime checkpoint are the same artifact wearing different clothes. The discipline of writing a per-step success criterion for your evals is also what gives you a runtime verifier to gate on. Teams that skip the specification work end up with agents they can neither measure offline nor protect at runtime, which are the same failure.

The clinical version of this is older and more honed than anything in software. A physician working a differential diagnosis does not trust a single reading and march forward; each hypothesis is a checkpoint that must survive confirmation before it is allowed to drive the next decision, and a result that fails to fit is a signal to stop and reset rather than propagate. That verification reflex — assume the current step may be wrong, check before you build on it, and treat a confident-but-unverifiable answer as the dangerous case — is precisely the reflex an agent architecture needs and an LLM lacks by default. It is the same structure I've argued a physician's differential diagnosis teaches about LLM hallucination: the model's failure mode is fluent, confident wrongness, and the only defense is a checkpoint that does not take the model's word for it.

So the reframe is this. Stop asking whether the model is smart enough to do the whole task in one uninterrupted chain. It is not, and at any realistic horizon it will not be, because 0.99ⁿ goes to zero and there is no model release that repeals exponentiation. Ask instead: how short can I make the horizon between gates, how cheaply can I verify each gate, and how many of my failures are recoverable enough to retry? The teams that ship reliable agents are not the ones with the best model. They are the ones who did the multiplication, believed it, and built for the world where errors are certain and only their propagation is optional.

Every step you cannot verify is a step where the exponent is still running.

Frequently asked questions

Doesn't the multiplication model overstate the problem, since agents can retry and self-correct?
Yes, and that's the point of the argument. The naive product of probabilities is the failure rate for independent, unrecoverable steps. Retryable, idempotent steps with a working verifier change the math dramatically — a single retry per checkpoint can lift a 50-step task from 61% to ~95%. The model isn't a prediction of doom; it's a diagnosis that tells you retries and checkpoints are where the leverage is.
If errors are correlated, why model them as independent at all?
Independence is the clean baseline that isolates the mechanism. Correlation makes it worse in one way (a genuinely hard step fails every retry, so retries don't rescue it) and better in another (an easy task rarely fails any step). The practical consequence is that you must separate transient failures, which retries fix, from systematic ones, which they don't — and route only the former to a retry loop.
Why does making the model smarter help so little?
Because end-to-end reliability is dominated by chain length, not by marginal per-step gains — until those gains push each step very close to 1. Going from 95% to 97% per step is a 2-point local improvement but moves a 20-step task from 36% to 54%. The returns are real but sublinear and expensive, which is why changing the task's topology usually beats waiting for a better model.

Filed under Applied AI. AI that ships, not AI that demos.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Applied AI

You Can't Evaluate an Agent You Can't Specify

Enterprise agent pilots stall at "impressive demo, never shipped" because teams score final answers while agents operate on trajectories — path-dependent decision sequences where one demo tells you almost nothing.

8 min read