Skip to content

The Automated Scientist Is a Category Error

Science is not hypothesis generation, which is cheap and always was. It is the disciplined killing of hypotheses against reality, plus the taste to pick which are worth testing — and neither is a text problem.

By Mehdi10 min read
Share
On this page

The dream of an AI that autonomously does science rests on a mistake about what science is. The pitch — a system that reads the literature, forms hypotheses, and discovers truths while you sleep — treats discovery as a generation problem. It is not. Hypothesis generation is the cheap part, and it has always been cheap. Any competent graduate student can produce a dozen plausible mechanisms before lunch, and any drunk at a bar can produce more with less inhibition. Science is the disciplined, expensive killing of those hypotheses against reality, plus the judgment of which ones are even worth the cost of the test. A large language model is superb at the first act and structurally absent from the other two. Calling that an automated scientist is not optimism. It is a category error.

Let me be precise about the error, because the surface plausibility is strong and the demos are seductive. When you watch a model emit fifty candidate explanations for an anomalous result, each internally coherent and dressed in the right vocabulary, it feels like thought. It is thought's exhaust, not its engine. The engine is somewhere else, and it runs on a fuel these systems cannot synthesize.

The bottleneck was never the supply of ideas

Start with the arithmetic, because the quantitative shape is the whole argument. A research question has some space of candidate explanations. In a real problem — say, why a particular long non-coding RNA tracks with cellular senescence — that space is effectively unbounded. There are direct-regulation stories, sponge-for-microRNA stories, chromatin-scaffold stories, pure-artifact stories where the signal is a batch effect from the sequencing run, and stories where the correlation is real but downstream of something you never measured. A language model can enumerate these, and hundreds more, at near-zero marginal cost. That is genuinely useful, in the way a good brainstorming partner is useful.

Now count the other side of the ledger. To distinguish between those explanations, you need experiments that produce different outcomes depending on which story is true. Each such experiment has a cost measured in months, reagents, animal cohorts, and the opportunity cost of the bench it occupies. In my own corner of computational biology — epigenetic aging clocks, lncRNA function, causal inference on the microbiome — the ratio is brutal. You can generate a thousand hypotheses in an afternoon and test maybe three in a year, and two of those will die of ambiguity because the experiment you could afford did not actually discriminate between the live options.

So the pipeline was never throughput-limited at the generation stage. Flooding the top of a funnel whose constraint sits at the bottom does not increase output. It grows the pile of untested claims, which is worse than useless, because someone now has to triage the pile, and triage is itself the scarce skill. An idea machine with no contact with reality does not accelerate science. It manufactures a backlog and calls it progress.

This is the same structural point I have made about drug discovery, where the binding constraint isn't the model but the ground truth it's starved of: you can predict a million binding affinities, but the wet-lab assays that tell you which predictions are true come in at a trickle, and the model cannot manufacture its own validation. Generation is abundant. Adjudication against reality is scarce. Pouring more abundance onto a scarcity does not relieve it.

Falsification is a physical act, not a text operation

Popper's contribution was not the tidy slogan that theories must be falsifiable. It was the deeper observation that the content of a scientific claim is exactly the set of observations it forbids. A theory that forbids nothing says nothing. The work of science is to find the forbidden observation and go looking for it in the world — to design the one experiment whose result the hypothesis cannot survive if it is false.

Notice what that act requires, and where a language model stands in relation to each requirement.

First, you must design an experiment that could kill the idea. Not confirm it — kill it. This is a genuinely creative act, and here a model can in principle contribute, because proposing candidate designs is still a generation task. But the property that makes a falsifying experiment good is that its possible outcomes cleave the hypothesis space cleanly, and judging whether a design has that property requires knowing how the physical system actually behaves: what confounds, what the assay's noise floor is, whether your perturbation has off-target effects that will muddy the readout. A model trained on text has read descriptions of these things. It has never been burned by one at three in the morning when the Western blot came back with a band at the wrong molecular weight.

Second, and no amount of scale touches this, you must run it against the world. The falsifying observation lives in the physical substrate: cells, tissue, a cohort of patients, an instrument with a calibration drift. The world is the only oracle that can say no in a way you did not anticipate. A language model has no channel to that oracle. It can predict what the world will probably say, drawing on the corpus of what the world has said before, but a prediction of the oracle's answer is not the oracle's answer. The entire epistemic value of an experiment is that it can surprise you, and it can only surprise you if it is causally connected to the thing you do not yet know. Text prediction is, by construction, a machine for producing the unsurprising — the high-probability continuation. That is the exact opposite of what a decisive experiment is for.

Take a concrete case from clinical reasoning, because medicine formalized this discipline long before machine learning existed. A differential diagnosis is not a list of diseases consistent with the symptoms — that list is the cheap, generative part, and a first-year medical student can produce it, and so can a chatbot. The skilled move is ordering the test that changes management: the one whose result reclassifies the patient across a treatment threshold. You start with a pre-test probability, you pick the investigation with a likelihood ratio large enough to move you across that threshold, and you deliberately avoid the tests that will come back "abnormal" in ways that are true but irrelevant and generate only follow-up noise. A physician who orders every possible test is not being thorough. He is being negligent, because he has confused enumeration with discrimination. The automated-scientist pitch is that same negligence, industrialized: enumerate everything, discriminate nothing, trust that volume is a virtue.

A degenerate programme by construction

Popper alone is too clean; real science does not abandon a theory the moment one prediction fails, because any prediction rides on a stack of auxiliary assumptions, any of which might be the thing that broke. Lakatos gave us the honest version. Science proceeds as research programmes, each with a hard core of commitments and a protective belt of adjustable assumptions around it. What separates a good programme from a bad one is not whether it ever meets anomalies — all of them do — but the direction of its adjustments.

A programme is progressive when it predicts novel facts: risky, ahead-of-time predictions that then get corroborated, where the theory sticks its neck out and reality declines to cut it off. A programme is degenerate when it only ever explains, after the fact, what has already been observed — when every anomaly is absorbed by adding an epicycle to the protective belt, and the theory never forbids anything it did not already know was safe. Ptolemy could fit any planetary motion by adding more circles. That was exactly the problem.

Now hold an autonomous idea-machine against that criterion. It generates explanations. Given any observation, it will supply a coherent account of why that observation was to be expected. Given the opposite observation, it will supply an equally coherent account of that. Its explanatory belt is infinitely elastic, because generating post-hoc coherence is precisely what a language model is optimized to do. What it structurally cannot do, absent a channel to the world, is issue a risky novel prediction and let reality adjudicate. It has no way to be caught out, because it never commits to a forbidden observation and then goes and checks.

So the autonomous scientist is not a programme that might happen to degenerate. It is a degenerate programme by construction. It has a hard core of pure generation and a protective belt made entirely of more generation, with no mechanism anywhere in the loop for a corroborated novel prediction to enter. You can bolt on retrieval, you can let it write and run code, you can have it critique its own output with another instance of itself — but if the whole loop closes inside the text distribution, with the final arbiter being another model's judgment rather than an experiment, you have built a system whose sole output is post-hoc coherence at scale. That is the Ptolemaic move with a GPU. It will feel productive and it will forbid nothing.

Taste is the scarce input, and it does not reduce to a prompt

There is a second scarcity, quieter than falsification and just as fatal to the autonomy dream: the judgment of which hypotheses are worth the cost of testing. The experimental budget is finite and small. Most hypotheses, even true ones, are not worth confirming, because confirming them changes nothing you would do next. The scientist's real craft is picking the few tests that are simultaneously decisive, affordable, and consequential — where the answer, either way, redirects the field or the treatment or the product.

This is scientific taste, and it is not soft or mystical. It is a learned probability distribution over which questions carry high expected information per unit cost, conditioned on a model of what the field already believes, what would shock it, and what would actually matter if true. It is built over years of being wrong in expensive ways, which is precisely the training signal a system with no exposure to experimental cost cannot receive. You cannot acquire taste for experiments by reading papers about experiments, any more than you can learn to trade by reading about markets without ever being liquidated.

Taste resists specification for the same reason it is valuable. If you could write down the criteria for a good experiment completely enough to hand them to an optimizer, the judgment would already be commoditized and would no longer be the bottleneck. The criteria are contextual, partly tacit, and they shift with the frontier. This is the general pattern I have argued is the durable human skill: specifying the problem precisely enough to be worth solving outlasts the ability to prompt a model to solve it, because the specification is where the irreducible judgment lives, and the closer a task sits to raw generation the faster its value decays. In science, the specification is the choice of experiment, and that choice is the whole game.

Instruments, not scientists

None of this makes AI marginal to research. It makes it an instrument, and instruments have changed science more than most theories have. The microscope, the sequencer, the mass spectrometer — each collapsed the cost of a specific measurement and let human judgment reach where it could not before. That is the correct frame for these systems, and it is not a demotion. A telescope did not do astronomy. It let astronomers see.

This is already the shape of the leverage in my own work. An agent that reads ten thousand papers and surfaces the three whose methods bear on my confound is doing real work, but I supply the judgment of which confound matters. An agent that drafts an analysis pipeline saves me a week of code, but I decide what a positive control has to rule out before the result means anything. An agent that generates fifty hypotheses is useful exactly to the degree that I have the taste to discard forty-nine and the discipline to design the one experiment that could kill the fiftieth. The acceleration is real and it is large. It is the acceleration of a scientist wielding a sharper instrument, not the replacement of the scientist by the instrument.

The failure I expect — and this is a forecast — is a wave of systems marketed as autonomous scientists that produce enormous volumes of plausible, coherent, untested claims, evaluated by other models against benchmarks rather than against the world, and celebrated for a throughput that measures the wrong thing. They will look productive in exactly the way a degenerate programme looks productive: always explaining, never forbidding, never once caught out by a reality they did not already contain. The people who extract real value will be the scientists who treat these systems as instruments and keep the two scarce jobs — killing hypotheses against the world, and choosing which are worth killing — firmly in human hands, because those jobs were never text problems and no fluency will make them so.

The world remains the only reviewer that can say no and mean it. Build the machine that helps you ask it better questions. Do not mistake the machine for the one who has to live with the answer.

Frequently asked questions

Doesn't AlphaFold prove AI can do science autonomously?
AlphaFold is a spectacular instrument, not an autonomous scientist. It solves a well-posed prediction problem — sequence to structure — trained on decades of human-curated crystallography and cryo-EM ground truth, and its outputs are still validated experimentally before anyone builds on them. It compressed a specific, expensive measurement, which is exactly the instrument role I argue for. It did not decide which biological questions were worth asking, design falsifying experiments, or generate its own ground truth.
Aren't there already 'AI scientist' systems that write papers end to end?
There are pipelines that generate hypotheses, write code, run simulations, and draft manuscripts. What they mostly automate is the cheap and the plausible: idea generation and prose. The papers they produce are typically evaluated by other language models or on benchmarks, not by risky predictions surviving contact with a physical world that can say no. That is the degenerate-programme failure mode Lakatos warned about, dressed in automation.
So is AI useless for accelerating research?
The opposite. The leverage is real, but it lives in the instrument role: reading a literature no human can hold in working memory, proposing experiments a scientist then filters through taste, running the mechanical parts of a workflow, and compressing specific measurements. The scientist still supplies the falsification discipline and decides what is worth the cost of a decisive test. Agents make good scientists faster; they do not remove the need for one.

Filed under Cross-Disciplinary Deep Essays. Where biology, computation, markets, and philosophy collide.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Cross-Disciplinary Deep Essays

The One MAMMAL Result That Ran in a Wet Lab

MAMMAL posts state-of-the-art on nine benchmarks, but the result that matters is four potency predictions on drugs it never saw, confirmed by a real assay. Here's why that one experiment outweighs the leaderboard.

8 min read
Cross-Disciplinary Deep Essays

Why Most AI Strategy Is Biologically Illiterate

Companies deploy AI like installing software. The right model is introducing an organism into an ecosystem, and selection pressure predicts the failure modes the ROI math can't see.

10 min read
Cross-Disciplinary Deep Essays

Scaling Is Not a Theory of Intelligence

The scaling hypothesis is the most successful empirical regularity in the history of machine learning and an explanation of nothing. The industry has bet its capital structure on a line it cannot explain continuing straight.

9 min read