Skip to content

Prediction Is Not Understanding: The Ceiling LLMs Inherit From Statistics

LLMs model the correlational structure of their training data with astonishing fidelity, but correlation is not causation and fluency is not truth. Knowing where that ceiling sits tells you what to trust them for and what the next paradigm must add.

By Mehdi8 min read
Share
On this page

A large language model is the most sophisticated statistical model ever built, and it inherits the deepest limitation of statistical models along with their power. It learns the correlational structure of its training distribution — which tokens follow which, in what contexts, with what probability — and it captures that structure with a fidelity that genuinely deserves the word extraordinary. But a joint distribution over text is not a theory of the world, and the difference is not philosophical hair-splitting. It surfaces in three specific, predictable places: causal reasoning, generalization outside the training distribution, and knowing whether a fluent sentence is actually true. Understand why those three, and you know both what to trust an LLM for and what the next paradigm has to add.

This is not the "LLMs are just autocomplete" dismissal. That line is lazy and wrong; the internal representations these models learn are rich, compositional, and useful in ways autocomplete never was. The claim is more precise and more durable: there is a ceiling, it comes from the mathematics of learning-from-observation, and no amount of scale removes it because scale improves the wrong quantity.

Three levels, and the model lives on the first

Judea Pearl's framing is the cleanest tool here, and it does real work rather than decoration. Reasoning about the world happens at three levels, and they are not interchangeable.

The first level is association: what is the probability of Y given that I observe X? This is P(Y | X) — the joint distribution, conditional probabilities, correlation. Everything statistics classically does lives here, and everything an LLM learns from next-token prediction lives here too. The model estimates, over an enormous vocabulary and context window, the conditional distribution of the next token given everything before it. It does this exceptionally well. Association is a real and powerful level; most useful prediction is association.

The second level is intervention: what is the probability of Y if I do X? Written properly, P(Y | do(X)), and the do-operator is not cosmetic notation. Observing that people who take a drug recover faster tells you P(recovery | observed to take drug). It does not tell you P(recovery | do(take drug)), because the people who took it may differ systematically from those who didn't — the sick-quitter effect, confounding by indication, a hundred named traps. The two quantities can point in opposite directions. To compute the interventional one you need something the joint distribution does not contain: a causal model, a directed structure that says which variables produce which. I have written separately about how this exact gap — reasoning from "we saw X and Y together" to "if we change X, Y will follow" — sits hidden inside almost every AI-driven business decision, and it is the same gap here, one level down in the stack.

The third level is counterfactual: given that X happened and Y happened, what would Y have been had X been different? This is the level of explanation, blame, and scientific understanding, and it requires an even more complete model.

An LLM trained on observational text operates on the first level. It can recite content from the second and third — the internet is full of causal claims and counterfactual reasoning, and the model reproduces their surface form fluently. What it cannot do is derive interventional or counterfactual answers it has not effectively seen, because it never built the causal object those answers require. It has the shadow of causation that correlation casts on text, not causation itself.

Why more data cannot fix this

The reflex is to assume this is a data problem: feed it enough and the causal structure will emerge. The mathematics says otherwise, and the obstacle has a name — identifiability.

Here is the whole problem in one line. The same observational distribution is consistent with many distinct causal structures. Suppose X and Y are correlated. That single fact is equally compatible with X causing Y, Y causing X, a hidden common cause Z driving both, or any mixture. All of these produce the identical joint distribution over X and Y. No quantity of samples from that distribution can tell them apart, because they are not different distributions to sample from — they are different mechanisms behind the same distribution. Observation alone cannot break the tie.

What breaks it is intervention — you do X and watch whether Y moves — or a structural assumption you import from outside the data. This is why randomized controlled trials exist. In my own work on aging biology and epigenetic clocks, the point is unavoidable and expensive. A methylation pattern correlates beautifully with chronological age; a model predicts age from a blood sample to within a few years. That is a triumph of the associational level, and it is genuinely useful. It also tells you nothing, by itself, about whether those methylation changes cause aging or merely track it. The clock predicts. It does not explain. To move from prediction to mechanism you need perturbation — knock the gene down, run the intervention, watch what actually shifts — and you need to fight confounders like batch effects, where the plate a sample was processed on leaves a signature that a naive model will happily learn as if it were biology. The reproducibility crisis across the life sciences is, in large part, a monument to what happens when correlational findings are read as causal ones. A model trained on the observational corpus inherits every one of those confounds silently. It has no perturbation channel. It cannot run the experiment.

So when someone says a sufficiently large model will "learn causality from text," the precise rebuttal is: text contains humans' causal claims, and the model can learn to reproduce those. That is inheriting our causal beliefs, correct and incorrect alike, not deriving causal structure from data. On any genuinely novel intervention — one nobody has written the answer to — the model is back to interpolating textual plausibility, and identifiability guarantees that observational text underdetermines the answer.

The grounding problem: fluent and false are indistinguishable from the inside

The second failure has a different root but the same shape. A system trained only on text has no independent contact with the world against which to check a claim. Its only signal is textual: does this sentence look like the kind of sentence that appears in the training distribution? That signal cannot, even in principle, separate a true statement from a fluent, well-formed false one — because falsity is not a textual property. "The femoral nerve innervates the quadriceps" and a plausibly-worded false claim about some nerve you'd have to open an anatomy atlas to refute are, to a text-only model, the same kind of object: grammatical, on-topic, distributionally normal. The model ranks them by plausibility, not by truth, because plausibility is the only variable it has.

This is exactly the hallucination failure mode, seen correctly. Hallucination is not a bug layered on top of an otherwise-truthful system that better engineering will patch out. It is the direct expression of what the system is: a generator of distributionally-plausible text with no truth-channel. When the plausible completion happens to be true, we call it knowledge. When it happens to be false, we call it hallucination. The model is doing the identical operation in both cases and cannot tell which it just did, because it has nothing to check against. Truth requires reference — some contact between the sentence and the state of affairs it describes. Pure text-learning severs exactly that link. This is the machine analogue of a clinician who has read every textbook and never touched a patient: fluent, confident, and with no independent purchase on whether the case in front of them matches the words.

What the ceiling tells you to build

State the ceiling plainly and it becomes a design spec rather than a lament. Scaling makes the correlational model better — sharper conditional distributions, better calibration within the training distribution, more of the long tail memorized. It does not make the model causal, and it does not make it grounded, because those capabilities require information that observational text does not contain at any volume. That is the boundary. It also names the four directions that actually cross it, each addressing a specific missing piece rather than hoping scale will conjure it.

Causal representation learning attacks identifiability head-on: learn variables and structure that support the do-operator, using interventional data, temporal ordering, or explicit structural assumptions the raw text can't supply. World models give the system a learned simulator it can roll forward and query counterfactually, so "what happens if" becomes a forward pass through a model of dynamics rather than a lookup of what people wrote. Tool use and retrieval are the pragmatic grounding fix already in production: don't ask the model to know the current price, the row count, or the sum — have it call a system that touches reality and read the answer off that. A calculator, a code interpreter, a database query, or a search index attached to a model is not a convenience but an epistemic upgrade; it swaps textual plausibility for a source with actual reference. Embodiment closes the loop hardest: a system that acts and observes consequences has a perturbation channel — it can do X and see Y move, which is precisely the signal observational learning lacks.

None of these is mystical, and none requires abandoning what LLMs are good at. They are additions that supply the missing level. The correct mental model of a frontier system is a superb associational engine wrapped in scaffolding that outsources the parts association cannot reach.

This is also the sharpest way to see why an idea-generator is not a scientist. A model that fluently proposes hypotheses is operating at the first level; it produces plausible causal-sounding sentences. Science is the second and third levels — designing the intervention that could kill the hypothesis, touching the world, updating on what came back. The generator can draft the conjecture but cannot, from inside its own distribution, perform the refutation that makes it knowledge, which is why treating the automated scientist as a finished loop is a category error rather than an engineering milestone. Prediction is upstream of understanding, and the distance between them is exactly the work that observation cannot do for you.

So trust the LLM where association is the right level: drafting, summarizing, translating, pattern completion, retrieval over things it has genuinely seen, generating candidates a grounded system will then check. Distrust it precisely where causation, novelty beyond the distribution, or bare factual truth are load-bearing and no tool is in the loop. The failures are not random; they cluster exactly at the boundary between the level the model occupies and the levels it only imitates. A system that predicts the world extraordinarily well and understands none of it is not a contradiction. It is the single most important thing to hold clearly in mind about the machines we are now building everything on top of.

Frequently asked questions

Does this mean scaling LLMs is a dead end?
No. Scaling reliably improves the correlational model — it sharpens the joint distribution over text and makes predictions better calibrated within distribution. What it does not do is convert a correlational model into a causal or grounded one, because those require information that is not present in observational text at any volume. Scaling raises the ceiling of one capability; it does not change which capability you have.
Isn't retrieval-augmented generation (RAG) already solving the grounding problem?
Partially, and that is exactly the point. Retrieval and tool use work by outsourcing grounding to a system that does touch reality — a database, a search index, a calculator, a code interpreter. The model stops asserting facts from textual plausibility and starts reading them off an external source of truth. That is the right architectural direction. But it only grounds the claims that pass through the tool; anything the model asserts on its own is still generated from the correlational model and subject to the same failure mode.
Can't a large enough model just learn causality from text, since text describes causal relationships?
Text contains causal claims, and a model can learn to reproduce them fluently. That is different from having a causal model you can query under intervention. The obstacle is identifiability: the same observational distribution is consistent with many distinct causal structures, so no volume of observational data uniquely picks out the mechanism. A model can memorize which causal sentences humans tend to write, but that is inheriting our causal claims, not deriving them — and it fails precisely on the novel interventions no one has written down.

Filed under Cross-Disciplinary Deep Essays. Where biology, computation, markets, and philosophy collide.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Cross-Disciplinary Deep Essays

The One MAMMAL Result That Ran in a Wet Lab

MAMMAL posts state-of-the-art on nine benchmarks, but the result that matters is four potency predictions on drugs it never saw, confirmed by a real assay. Here's why that one experiment outweighs the leaderboard.

8 min read
Cross-Disciplinary Deep Essays

Why Most AI Strategy Is Biologically Illiterate

Companies deploy AI like installing software. The right model is introducing an organism into an ecosystem, and selection pressure predicts the failure modes the ROI math can't see.

10 min read
Cross-Disciplinary Deep Essays

Scaling Is Not a Theory of Intelligence

The scaling hypothesis is the most successful empirical regularity in the history of machine learning and an explanation of nothing. The industry has bet its capital structure on a line it cannot explain continuing straight.

9 min read