What actually works when you put machine intelligence into a real product or workflow — evaluation, cost, latency, failure modes, and the org design around it. Written by someone who builds with these systems, not just about them.
The binding constraint on autonomous agents isn't intelligence — it's that per-step success probabilities multiply. A 95%-reliable agent finishes a 20-step task 36% of the time. The fix is topology, not IQ.
MAMMAL's real contribution is not a benchmark win. It's a bet that molecules, proteins, and gene expression can share one sequence-to-sequence language — and a 458M-parameter generalist that proves the bet pays.
Enterprise agent pilots stall at "impressive demo, never shipped" because teams score final answers while agents operate on trajectories — path-dependent decision sequences where one demo tells you almost nothing.
The limit on agent autonomy isn't capability, it's accountability. Every high-trust role is built around liability, and an AI bears no consequences for being wrong, so a human stays on the hook permanently.
The consequential shift isn't agents running your errands, it's agents transacting with other agents. That needs identity, binding commitment, and settlement primitives the web never built, and it opens an adversarial surface it has never faced.
Today's agents are amnesiacs that re-solve your problem from scratch every session. The next advance isn't a smarter model but persistent, structured memory, and the accumulated record of working with you is where the real moat forms.
As agents act on our behalf, the binding constraint stops being capability and becomes trust: whether an agent serves your interest, resists hijacking, and is who it claims to be. The winners will compete on verifiable trust primitives, not raw IQ.
Every model that ranks "what drives outcome Y" hands you a correlation, but you spend money on causes. The gap between the two is where data-driven companies quietly bleed, and more data makes it worse.
AI drug discovery keeps slipping because biology's labels are scarce, confounded, and often non-reproducible. You can't learn a reliable function from unreliable data; more compute just delivers the wrong answer faster.
From inside a working lab: agents compress every part of science where a check is fast and cheap, and stall wherever the answer is gated by a wet-lab experiment that takes weeks. Difficulty was never the dividing line.
A clinical AI that is right 95% of the time is more dangerous, in one specific way, than one right 70% of the time: high reliability switches off the human vigilance the whole safety case depends on, and deskilling means the backstop never forms.
Clinical AI's real future isn't a diagnosis-in-a-box. It's an agent that generates the full hypothesis space and proposes the cheapest discriminating test, while the physician stays the control layer that owns the priors and the cost of being wrong.
LLMs are confident, fluent pattern-matchers that will always produce a plausible answer, right or wrong. Medicine built a discipline for reasoning safely around exactly that kind of mind: the differential diagnosis.
9 min read
Go deeper on ai.
Get new Applied AI essays — and the best of the other six pillars — delivered as they publish.