A model can only be as good as the function it's asked to approximate, and in drug discovery that function is being estimated from labels that are scarce, noisy, confounded, and — for a disturbing fraction of the literature — non-reproducible. The models are not the constraint. GPT-scale architectures, graph neural networks over molecular structure, transformers over sequence: these are extraordinary function approximators, and they have largely stopped being the rate-limiting step. What limits them is that we are training them to predict biological outcomes we cannot yet measure reliably. Feed a powerful learner unreliable labels and you do not get a weak model. You get a confident, precise, wrong one, and you get it faster with more compute.
This is why the "AI cures disease" timeline keeps sliding a year to the right every year. Not because the algorithms disappointed, but because the ground truth was never as solid as the demos implied. I want to walk through the mechanism concretely, from inside the data, because the failure is specific and it is not a failure of AI. It's a failure of measurement wearing a machine-learning costume.
What "ground truth" actually is in biology
In a well-posed learning problem you have inputs x, labels y, and a stable, if noisy, relationship y = f(x) + ε, where ε is roughly random and shrinks in aggregate as you collect more data. Image classification works because a cat is a cat regardless of which GPU labeled it, and because you can get ten million labels cheaply.
Now consider a representative comp-bio dataset of the kind that trains a "predict which compound modulates this pathway" model. You have, say, 80 tumor samples and 20,000 measured genes, sometimes hundreds of thousands of features once you include methylation sites or transcript isoforms. The feature count exceeds the sample count by two or three orders of magnitude. In that regime almost any label can be perfectly separated by some combination of features. The question is never "can the model fit the data." It always can. The question is whether what it fit is biology or bookkeeping.
Usually it is partly bookkeeping. And the single most corrosive form of bookkeeping has a name.
Batch effects: when the signal is the calendar
Here is the failure I have watched break more analyses than any other. You run your disease samples in the lab one week and your controls the next, because that is how samples arrive. The reagent lot changes. The ambient temperature of the sequencer drifts. A different technician loads the plates. When you then look for genes that separate disease from control, you find hundreds, and a large share of them are separating Tuesday from the following Tuesday, not sick from healthy.
This is not a hypothetical. It is the default. In one now-infamous class of genomics results, a substantial fraction of the "signal" distinguishing groups was traceable to the date samples were processed. If your cases and controls are confounded with run date, the day of the experiment is a feature that predicts the label extremely well, and it is a feature that will never generalize to a new patient because new patients are not run on your Tuesday.
A model has no way to know this. To the optimizer, "expression pattern that correlates with disease" and "expression pattern that correlates with processing batch" are the same thing: a predictive feature with low loss. The learner will happily seize the batch signal, because it is often cleaner than the biological one. Instrument drift is more consistent than a disease that presents fifty different ways. So the more capacity you give the model, the more precisely it learns the artifact. Scale actively hurts here. It sharpens the wrong function.
The standard defense is batch correction: you statistically regress out the known batch variable. But you can only correct for confounders you measured and named. The uncontrolled ones stay baked in, and worse, if the batch is perfectly confounded with the biology (all cases on one machine, all controls on another), no correction can separate them, because the information required to do so was never collected. That is a study-design failure that no amount of downstream cleverness recovers. The bias was fixed at the wet bench, before a single line of code ran.
This is a causal-identification problem in disguise
Strip away the biology and the structure is familiar. The model observes a correlation between molecular features and an outcome, and we want it to tell us what will happen when we intervene, when we give a patient the drug. But observational correlation and interventional effect are different quantities, and they coincide only when there are no open backdoor paths through unmeasured confounders. Batch effect is a backdoor path. So is the disease itself changing behavior (sicker patients get sampled differently), and so is every clinical covariate that steered who ended up in which group.
A predictive model, however large, estimates P(outcome | features observed). Drug discovery needs P(outcome | features set by intervention). These are the same number only under identification conditions that biological data almost never satisfies. This is the exact trap I've written about in the causal-inference problem hiding inside every AI business decision: the algorithm optimizes association flawlessly and delivers a beautifully calibrated answer to a question no one wanted asked. In a churn model that costs you a wasted retention budget. In a drug program it costs you a hundred million dollars and eight years to discover, in a Phase II readout, that the correlation was never causal.
More compute does not close an identification gap. Identification is a property of how the data was generated, not of how hard you fit it. You cannot optimize your way out of a confound; you can only design or intervene your way out, upstream.
Small samples, and the selection filter on top
Layer two more problems onto the confounding, because they compound.
First, sample sizes in biology are genuinely tiny relative to the complexity of what's being modeled, not because researchers are lazy but because each label can cost thousands of dollars and months of wet-lab work, and because sick humans are not a resource you can scale like scraped web text. With 80 samples and 20,000 features, the variance of any effect estimate is enormous. Small, high-variance samples are exactly the setting where extreme, unreplicable results appear most readily: the noisiest experiments throw off the most spectacular-looking effects, and those are the ones that clear a significance threshold and get published. The apparent effect you selected on was inflated by the same noise that made it visible, which means the true effect is systematically smaller than the one you trained on. That is regression to the mean eating your growth numbers, transposed from a marketing dashboard to a preclinical screen: the winners you selected are winners partly because they got lucky, and luck does not replicate.
Second, publication bias curates the training corpus before you ever touch it. The literature — which is where a large model's priors and a scientist's hypotheses both come from — is a selected sample of positive results. Negative findings sit in file drawers. So the effect sizes reported in papers are, on average, overestimates, and the set of "known" drug-target relationships is enriched for false positives that happened to reach significance in an underpowered study. Train on the published record and you inherit a distribution that is optimistically skewed by construction.
Now the reproducibility crisis stops being a vague worry and becomes an engineering spec. When large-scale replication efforts have gone back to check landmark preclinical findings, a substantial share — in some cancer-biology replication programs, more than half — failed to reproduce the original effect. Sit with what that means for a training set. A meaningful fraction of your labels are not merely noisy around a true value; they are pointing at effects that do not exist. That is not ε averaging to zero. That is a corrupted f. You are asking the model to learn a function whose values were, in places, made up by chance and selection.
Why the honest wins prove the point
The strongest objection is protein structure prediction, and it deserves a straight answer, because it does not weaken the argument. It is the control experiment that confirms it.
Structure prediction worked spectacularly for reasons that are entirely about ground truth. A protein's folded structure is very nearly a deterministic function of its amino-acid sequence: same sequence, same physics, same fold, largely independent of which lab, which day, which technician. The label is stable. And the labels are abundant and standardized: decades of X-ray crystallography and cryo-EM deposited into a shared, curated repository of hundreds of thousands of experimentally solved structures, produced by mature methods whose error modes are understood and cross-validated. Clean function, low structured noise, large n, a real held-out test. Under those conditions a large model is exactly the right tool and it delivered a genuine, historic result.
That is the whole thesis stated in the positive. Where biology hands us clean, abundant, causally-interpretable ground truth, AI is transformative right now. Where it hands us scarce, confounded, selection-filtered, half-reproducible labels — most of disease biology, most of efficacy and toxicity prediction, most of what "curing disease" actually requires — the model faithfully learns the mess. The variable that moved was never the algorithm. It was the data-generating process behind it.
This also predicts where AI drug discovery will keep posting real wins versus press releases. Problems close to physics and chemistry, with stable labels and large curated datasets — binding-pose geometry, some ADMET properties, molecular-property prediction where high-throughput assays are standardized — will progress fast. Problems that route through whole-organism biology, where the label is a stochastic clinical outcome mediated by dozens of confounded systems, will keep slipping, and no architecture change will rescue them.
The fix is a wet lab, not a training run
If the constraint is label quality, the remedy is not the thing the field is optimized to deliver. It is not a bigger model, a longer context, or another hundred thousand GPUs. It is better ground truth, and better ground truth is slow, expensive, unglamorous experimental work:
- Standardized generation, so that "which lab, which day" stops being a predictive feature: randomized processing order, balanced batches, samples blocked so cases and controls are never confounded with a run.
- Adversarial controls, designed to break a spurious signal before a model can exploit it: negative controls that should show nothing, positive controls with known effect sizes, and held-out data generated by a different site on different instruments.
- Causal, not merely observational, data: perturbation experiments that intervene on the system (knock the gene down, add the compound, measure the response) rather than passively observing correlations, because intervention is the only thing that identifies the interventional quantity we actually care about.
- Pre-registration and mandatory replication, so the training corpus stops being a selected sample of lucky positives.
Every item on that list is measured in years and reagent budgets, not epochs. That is the uncomfortable part for a field whose center of gravity, and whose capital, sits on the compute side. The marginal dollar in AI drug discovery is far better spent generating one clean, causally-interpretable dataset than fine-tuning one more model on the dirty data we already have. The bottleneck is not intelligence. It is measurement, and measurement does not have a scaling law.
Give the model reliable labels and it will find the drug. We just haven't finished doing the experiment yet.