Enterprise software procurement rests on two assumptions that agents quietly violate: that you can write a specification the product must meet, and that you can run an acceptance test that proves it did. Deterministic software honors both. You spec the invoice-matching logic, you feed it a test suite, it passes or fails, and once it passes it keeps passing. An agent does neither. It is a probabilistic system whose real failure rate you do not learn from a demo or a pilot; you learn it in production, on your distribution of inputs, weeks after you signed. So the standard model, per-seat licensing, quietly hands the buyer a bill that is fixed and a reliability risk that is not. You pay the same whether it works or not, and the vendor gets paid either way.
That is the whole problem, and outcome-based contracting is the whole answer: pay per resolved ticket, per correct extraction, per completed task. Nothing else realigns the incentive, because nothing else moves the reliability risk onto the only party who can actually reduce it.
Acceptance testing assumes a spec that agents don't have
Procurement is a risk-transfer ritual. You write requirements, the vendor commits to them, you validate against an acceptance test, and at the moment of sign-off the risk of "it doesn't do what we agreed" transfers to the vendor. For deterministic software this works because behavior is a function of the spec. The test suite is finite and the system is fixed, so passing the suite is real evidence about all future runs. The map is the territory.
For a probabilistic system, passing a finite test set tells you almost nothing about the tail. I spend my research life on exactly this failure. An epigenetic clock that predicts age beautifully on its training cohort routinely falls apart on a new one, not because the biology changed but because the new samples came off a different sequencing run, a different lab, a different population: batch effects and distribution shift. The validation number was never a property of the model. It was a property of the model crossed with the cohort you happened to test on. A diagnostic assay with 95% sensitivity in the trial can miss badly in a clinic whose patients don't look like the trial's patients. The reproducibility crisis in biology is, in large part, this one mistake made at scale: treating performance measured on one distribution as if it were an intrinsic guarantee.
An AI agent is the same object. Its accuracy on the vendor's benchmark, or on your two-week pilot, is a measurement on a specific input distribution. Your production traffic is a different distribution: messier tickets, weirder edge cases, the long tail of things nobody scripted. So there is no acceptance test that can discharge the risk at sign-off, because the thing you care about, the failure rate on live traffic, is unknowable until live traffic exists. Procurement's central mechanism, the moment where risk transfers, has no place to stand.
Per-seat pricing doesn't just ignore this. It inverts it. You buy 200 seats at, say, $40 a month, which is $8,000 monthly and $96,000 a year, and that number is locked before a single real ticket is touched. If the agent resolves 80% of what it attempts, you overpaid relative to nothing in particular. If it resolves 45% and quietly mangles the other 55% into reopened tickets and furious customers, you still owe $96,000, plus the cost of cleaning up the mess, plus the human headcount you couldn't cut because the thing wasn't trustworthy enough to run unsupervised. The buyer holds 100% of the reliability risk and the vendor holds none. The vendor's revenue is a function of seats sold, not outcomes delivered, so the vendor's rational priority is closing seats and renewing them, not grinding down the error rate on your ugliest 10% of cases. Their skin is in your signature, not your success.
Outcome pricing moves the risk to the party that controls it
Now price the same deployment at $2 per fully resolved ticket. Suppose your org handles 100,000 tickets a month and a fully-loaded human resolution costs you roughly $7. If the agent cleanly resolves 60,000 of them, you pay $120,000 and displace $420,000 of human cost. If it resolves 30,000, you pay $60,000. If it resolves nothing, you pay nothing. The vendor's revenue is now a direct function of the exact quantity the buyer cares about and the vendor previously ignored. Every point of accuracy they claw back on the hard cases is money in their pocket. Every hallucinated resolution that bounces back is money out of it.
This is not a pricing tweak. It relocates the reliability risk to the party with the controls to reduce it: the vendor, who owns the model, the prompts, the tool-calling, the retrieval, the fine-tuning, the escalation logic. The buyer never had those controls and was never in a position to carry that risk; per-seat made them carry it anyway. Outcome pricing forces the vendor to own accuracy, own the exceptions, and own the long tail, because the long tail is now the difference between their margin and their loss. It converts a vendor whose job was to sell you software into a vendor whose job is to make the software work.
The three things that make this hard
If it were easy it would already be the default. It's hard for three specific reasons, and each has a real answer.
Attribution. Did the agent resolve the ticket, or did the human it handed off to? In any human-in-the-loop flow where the agent drafts and the human approves, credit is genuinely ambiguous, and a vendor billing per outcome has every reason to claim the ambiguous ones. This is not a contracts problem dressed up as a technical one. It is a causal-inference problem, and it has the same answer causal inference always gives: a control arm. Route a random slice, say 5%, of eligible tickets to the human-only path and measure the difference in resolution. The gap between the agent-served population and the held-out one is the agent's incremental contribution, cleanly identified, no argument required. I run this exact logic on microbiome data to separate what a microbe causes from what merely correlates with it; the discipline is identical. Bill on measured incrementality, not on self-reported closures.
Gaming. The instant you pay per "resolved" ticket, you've created pressure to mark things resolved that aren't. Deflection that just exhausts the customer into giving up scores as a win. The fix is to define the outcome unit so that a fake resolution doesn't pay: a ticket counts only if the customer doesn't reopen or escalate within, say, seven days, and doesn't rate the interaction below threshold. Durability and satisfaction fold into the unit itself. Now a hollow resolution is a liability on the vendor's books, not a line item on yours.
Defining the unit. "Resolved ticket" is clean. "Correct extraction" is cleaner, because you can audit a sample against ground truth and price on measured precision. But plenty of valuable work resists a crisp unit: a research summary, a strategy memo, a judgment call. Where you cannot define an outcome unit you can measure and defend, outcome pricing does not apply, and you should be honest about that rather than force it. The tell that a workflow is ready for outcome contracting is that you can already state, in one sentence, what "done and correct" means and how you'd verify it on a random sample. If you can't, you're not buying an outcome. You're buying an assist, and you should price it as one.
Why the market is heading here anyway
The reason this stays a minority structure today is that vendors can't afford it. When inference is expensive, a vendor pricing per successful outcome is exposed on every failed attempt: they burn tokens on the ticket the agent flubs and collect nothing. So they retreat to per-seat, where the buyer subsidizes the failures. That constraint is dissolving. As I argued in the inference-cost collapse, the cost per token is falling fast enough to break every pricing model built on the assumption that compute is the scarce input. When a failed attempt costs the vendor a fraction of a cent, they can eat the misses and still profit on the hits, and outcome pricing goes from financially reckless to obviously correct. The vendors who move first will use it as a weapon: "we only bill when it works" is an unanswerable pitch against a competitor charging per seat for a probabilistic black box. Per-seat for autonomous work becomes the mark of a vendor who doesn't believe their own accuracy numbers. That's a forecast, but I'd bet on it.
What to actually do as a buyer
Three moves, in order of leverage.
First, refuse per-seat for probabilistic, autonomous work. Not as a negotiating posture, as a category rule. If the tool acts on its own and its failure rate is discovered in production, per-seat structurally puts the risk on you, and no discount fixes a structure. Reserve per-seat for deterministic tools and for genuine assists where the human stays fully in the loop and reviews every output. A code-completion sidebar is fine on a seat. A tier-one support agent running unsupervised is not.
Second, define the outcome unit before you talk price, and define it to include durability. Write down exactly what "resolved," "extracted correctly," or "completed" means, the window over which it has to hold, and the sampling method you'll use to audit it. The vendor who flinches at a precise, auditable unit is telling you their accuracy won't survive one. The vendor who leans in is telling you the opposite.
Third, and this is where most outcome deals quietly fail, make the vendor bear the exception tail, not just the happy path. It is easy to write a contract that pays per resolved ticket and stays silent on what happens to the ones the agent can't handle. Then the vendor cherry-picks the easy 80%, bills you for it, and dumps the hard 20% back on your team as if that were free. But the hard 20% is exactly where the cost lives; as I argued in the last 20% is where agent ROI goes to die, the exception tail is not a rounding error on the automation, it is the economics. Your contract has to price the tail into the vendor's problem: either they resolve it and get paid, or they route it cleanly with full context and eat a penalty for the handoff, but they never get to pretend it doesn't exist. Own the tail or lose the deal.
You are not buying an agent. You cannot spec one, cannot acceptance-test one, cannot take ownership of one the way you take ownership of code that does the same thing every time. What you can buy is a resolved ticket, a correct extraction, a finished task, priced so that the vendor only wins when you do. Everything else is paying full freight for someone else's uncertainty and calling it software.