The fastest way to make your agent better is almost never a smarter model. It is a better-designed environment for the model you already have. An agent's capability is bounded by the action space you expose and the feedback you return, not by the raw reasoning of the policy running inside it. Give a frontier model ambiguous tools, irreversible actions, and opaque errors, and it will loop and stall on a task that a weaker, cheaper model completes cleanly when the tools have crisp contracts and the state is legible.
This is literal, not metaphor. In reinforcement-learning terms an agent is a policy: a function from observed state to action. Model quality is the quality of that function. But the score any policy achieves is a joint product of the policy and the environment: the observation space, the action space, the transition dynamics, and the signal it gets back. You can hold the policy fixed and move realized performance across nearly the whole range by changing the environment alone. Most teams pour their effort into the policy — prompt tweaks, model swaps, a fresh system message every Friday — and leave the environment roughly where the SDK defaults dropped it. They have the leverage backwards.
The arithmetic that should reorganize your priorities
Agentic tasks are sequential. The agent takes a step, observes a result, takes another. That structure means per-step reliability compounds multiplicatively, and multiplication is unforgiving.
Take a task that requires twenty tool calls to complete, a modest number for anything real: reconciling an order, migrating a config, working through a support ticket. Suppose your agent gets each step right 95% of the time. The probability it completes the whole chain without a fatal misstep is 0.95^20, which is about 0.36. Roughly two runs in three fail somewhere. Now improve per-step reliability to 99%. The same twenty-step task now succeeds 0.99^20 of the time, about 0.82. You did not touch the model. You moved end-to-end success from 36% to 82% by cutting the per-step error rate from one-in-twenty to one-in-a-hundred.
Here is the part that matters for where you spend your week. Swapping to a marginally smarter model might lift per-step reliability from 95% to 96%. Redesigning the tool the agent keeps tripping over — making its contract unambiguous, its errors actionable, its effect reversible — routinely takes a single step from 95% to 99%+, because most step failures are not reasoning failures. They are the agent doing something locally sensible against a tool that was underspecified, or misreading a state it could not fully see. This is the same compounding that governs the compounding-error problem: in a long chain, the per-step error rate is the whole game, and per-step error rate is overwhelmingly a property of the environment, not the model.
You don't fix a call center by hiring smarter operators
I learned this discipline before I ever wired up a tool schema, building Kommerce, a commerce operating system for cash-on-delivery markets, where every order is confirmed by a human operator on a phone call before a courier is dispatched. Cash-on-delivery is a trust-scarce environment: no card captured, no money committed until a stranger hands cash to another stranger at a door. The confirmation call is the load-bearing step, and a bad one burns a real delivery attempt in the physical world.
The naive fix for a call center that makes mistakes is to hire better operators. It does not work, and it does not scale. You cannot recruit your way to reliability when the labor pool is ordinary people, turnover is high, and the task runs thousands of times a day. What works is redesigning the environment so an ordinary operator cannot easily fail. You constrain the action space to the few moves that are valid at this point in the call. You put the exact next-step prompt on the screen so there is no memory or judgment tax. You make steps reversible — a mis-click reschedules rather than cancels, a wrong status can be walked back — so no single error is catastrophic. You surface the order's full state in one legible view so the operator is never reasoning about facts they cannot see.
The operator's "IQ" barely moves the aggregate error rate. The design of their environment moves it by an order of magnitude. This is task-based labor economics applied at the desk: decompose the job into steps, then engineer each step so the marginal operator succeeds by default. An LLM agent is exactly this kind of worker: capable, non-deterministic, occasionally overconfident, operating one step at a time against tools you built. The discipline that makes a call center reliable is the discipline that makes an agent reliable, and it is not a prompt.
Legible state, or the agent is reasoning about a ghost
An agent can only reason over state it can observe and trust. This sounds obvious and is violated constantly. Teams hand an agent a tool that returns {"status": "ok"} and then wonder why it can't recover when something downstream is subtly wrong. The agent is not stupid. It is blind. It acted on the only observation you gave it, and that observation did not contain the fact it needed.
Two failure modes hide here, and they are distinct. The first is unobservable state: the relevant fact exists in your system but the tool never returns it, so the agent is guessing. The second is untrustworthy state: the tool returns something, but the field is ambiguously named, inconsistently typed, or means different things in different responses, so the agent cannot rely on it. Both collapse the agent's effective observation space, and a policy cannot condition on what it cannot see or cannot trust.
This is why tool design is really schema design, and why your schema is your strategy applies with full force. The schema you expose to an agent is its observation space and its action space. A field that is present, correctly typed, and unambiguously named is a fact the agent can plan against. A field that is absent, or a status enum that overloads "pending" to mean four different things, is a hole in the agent's model of the world. When you design the return shape of a tool, you are not formatting data. You are deciding what the agent is allowed to know and what moves it is allowed to make. Ambiguity in the schema propagates directly into the policy's decisions, and it does so silently, because a good model will produce a plausible-looking action even when it is operating on a corrupted picture.
Most "the agent isn't good enough" complaints are misdiagnosed
When a team tells me their agent is unreliable and the fix must be a better model, I ask to watch the transcripts. The pattern is nearly always the same, and it is nearly always the environment.
The agent calls a tool with reasonable arguments and gets back Error: invalid request with no indication of what was invalid or how to fix it, so it retries the identical call, hallucinates a fix, or gives up. It performs an action that is not idempotent; the first attempt actually succeeded but timed out; it retries, and now there are two orders. It reads a status field that is stale because the tool returns cached state, and it confidently builds three more steps on a false premise. It faces a tool named update that sometimes creates and sometimes overwrites depending on a flag whose consequences it cannot see, so it makes the wrong bet. None of these are reasoning failures a smarter model reliably escapes. They are contract, observability, and reversibility failures. The model is doing competent inference over a broken interface.
The tell is diagnostic. If a careful senior engineer, handed only the tool signatures and error strings your agent sees — no source code, no tribal knowledge — could not reliably complete the task either, then the model is not your bottleneck. Your environment is illegible, and you are asking the model to compensate for it with raw intelligence. Sometimes a frontier model does brute-force through a bad environment, which is the cruelest outcome, because it convinces you the fix is always "wait for the next model" rather than "fix the tool." That is a treadmill. Environment design is a durable asset that pays off under every model you will ever run.
The tool-design checklist
Before you reach for a bigger model, audit every tool your agent can call against five properties. Each one directly lowers the per-step error rate the compounding math punishes.
-
Clear contracts. A tool does exactly one thing, with a name that states it, typed arguments with no hidden modes, and a documented, predictable return shape. If a human reader cannot tell from the signature what the tool does and returns, neither can the agent. Split overloaded tools: a
create_orderand anupdate_orderbeat onesave_orderwhose behavior depends on an implicit flag. -
Reversibility and idempotency. Prefer actions that can be undone or safely retried. Make writes idempotent with a client-supplied key so a retry after a timeout does not double-charge or double-ship. Where an action is genuinely irreversible, gate it behind an explicit confirmation step and a dry-run mode. Reversibility converts a catastrophic error into a recoverable one, which is what turns a fragile chain into a robust one.
-
Actionable errors. An error must tell the agent what went wrong and what to do next, in the response itself.
Error: invalid requestis useless.Error: field 'quantity' must be a positive integer; received -1. Retry with a corrected value.lets the policy self-correct in one step instead of looping. Write error messages as if the reader is a competent agent with no access to your docs, because that is precisely the reader. -
Observability. Return the state the agent needs to plan, not the minimum your API happened to expose. If the next decision depends on the order's current status, the tool that touches the order should return that status, fresh, not cached. Make the observation space a superset of what a correct policy needs. Illegible state forces guessing, and guessing is where chains die.
-
Least surprise. Tools should behave the way their name and signature imply, consistently, every time. No hidden side effects, no fields that change meaning by context, no silent truncation. Surprise is the enemy of a policy that has learned your interface, because the agent generalizes from what a tool appears to do.
None of this requires a new model. All of it requires treating your tool surface as a designed product with the agent as its user, held to the same least-surprise standard you would demand of any API a human team had to build against under deadline.
The teams that win the agent race will not be the ones with privileged access to a model half a generation ahead. That gap closes every few months, and it closes for everyone. The durable advantage is an environment where an ordinary policy cannot easily fail: constrained actions, reversible steps, errors that teach, state you can trust. That is not a prompt you can copy. It is engineering, and it is yours. Stop trying to hire a smarter operator. Redesign the desk.