Skip to content

Scaling Is Not a Theory of Intelligence

The scaling hypothesis is the most successful empirical regularity in the history of machine learning and an explanation of nothing. The industry has bet its capital structure on a line it cannot explain continuing straight.

By Mehdi9 min read
Share
On this page

The scaling hypothesis — that model loss falls predictably as you add parameters, data, and compute — is the most successful empirical regularity in the recent history of machine learning. It is also an explanation of nothing. A curve that has fit the last several orders of magnitude is a description of what happened, not an account of why, and the two are not interchangeable. The capital structure of frontier AI rests on the assumption that a line nobody can explain will keep going straight. That is a defensible bet. It is not a theorem, and the entire strategic question is whether the people allocating the capital know the difference.

I want to argue this precisely, because both camps argue it badly. The boosters treat the scaling curve as a law of nature and reason forward from it as if from Newton. The skeptics point at each temporary plateau and declare the wall. Both make the same mistake in opposite directions: they read a mechanism off a trend line that contains no mechanism. What the trend line actually licenses is narrower, stranger, and more useful to understand than either tribe admits.

Kepler had the curve; Newton had the reason

Johannes Kepler spent years buried in Tycho Brahe's observations and extracted three laws of planetary motion. Orbits are ellipses. A planet sweeps equal areas in equal times. The square of the orbital period is proportional to the cube of the semi-major axis. Correct, predictive, quantitatively exact — and completely silent about why. Kepler had fit the data. He had no idea what generated it.

Newton supplied the generator: an inverse-square law of gravitation from which Kepler's three laws fall out as theorems. In doing so he did something curve-fitting cannot do at any level of precision. He told you the boundary conditions. He told you the two-body ellipse is an idealization, that a third mass perturbs it, that the regularity is a limiting case of something deeper. This is why, when Mercury's perihelion was later found to precess by a few extra arcseconds per century, the anomaly was legible. There was a mechanism precise enough to be violated, and the violation pointed at general relativity. A mechanism is what lets you see the edge of a regularity before you drive off it.

Scaling laws are Kepler without Newton. We have an extraordinary curve, loss falling as a clean power law across many orders of magnitude of compute. We do not have the because. Without the because, we have no principled way to say where the curve ends — which is precisely the thing a company betting fifty billion dollars on the next order of magnitude most needs to know.

What the curve actually says, and the arithmetic hiding inside it

Look at the functional form. The empirical scaling laws describe test loss as roughly

L(N) = E + A · N^(−α)

where N is model size (the same shape holds for data and compute), and the fitted exponents α are small — in the published estimates, on the order of 0.05 to 0.1. Two features of this equation matter and are almost never stated plainly to the people writing the checks.

First, a power law with a tiny exponent is a punishing master. Take α ≈ 0.076, near the original published value for parameter scaling. To halve the reducible part of the loss, L − E, you multiply N by 2^(1/α) = 2^13.2, roughly nine thousand. Nine thousand times the parameters to cut the remaining reducible loss in half, and nine thousand times again for the next halving. The curve is smooth and it is real, and it is also an escalator into a wall of cost. Every constant-factor gain demands an exponentially growing input. That is not a wall in the sense of stopping. It is a wall in the sense of economics.

Second, and more fundamental: the term E is an asymptote. It is the irreducible loss, the intrinsic entropy of the text, the part no amount of modeling can predict because it is genuinely unpredictable given the context. The power law bakes its own ceiling into its own equation. As compute goes to infinity, loss does not go to zero; it goes to E. The very curve people extrapolate optimistically is, on its face, a description of asymptotic diminishing returns toward a floor we cannot compute in advance. The optimism is not in the equation. It lives in a belief about what happens to capabilities as loss creeps toward E, and that belief is a separate, unexamined thing.

The map from loss to capability is unexplained and not monotone

Here is the move everything downstream depends on and that no one has justified: the identification of falling loss with rising intelligence. Scaling laws are about cross-entropy loss on next-token prediction. They are not about reasoning, planning, truthfulness, or any of the capabilities anyone actually cares about. The claim that lower loss reliably buys those capabilities is not part of the scaling law. It is an empirical hope layered on top of it.

And the map from loss to capability is demonstrably not clean. The debate over "emergent abilities" made this concrete: several capabilities that appeared to switch on suddenly at scale turn out, under smoother evaluation metrics, to improve gradually — the emergence was partly an artifact of how we chose to score. Cut the other way, the Inverse Scaling Prize documented real tasks where larger models get reliably worse, where more scale moves the wrong direction. So the loss curve can glide smoothly downward while the capability you care about is flat, jumping, or regressing, depending on the task and the metric. You cannot read capability off loss. The one quantity scaling laws actually predict is not the quantity the industry is selling.

This is the same error I keep flagging in a different domain, a predictive artifact mistaken for the thing itself, which I argue at length in Prediction Is Not Understanding. A model can drive its loss down by becoming a better next-token predictor without acquiring the underlying structure we impute to it, exactly as a regression can nail the fit without capturing the cause. Low loss is a prediction. Intelligence is an explanation. Conflating them is not a technicality. It is the load-bearing assumption of a trillion-dollar sector.

Interpolation is not extrapolation, and the data is finite

Two structural headwinds sharpen the point, both about the difference between filling in a distribution and going beyond it.

A model trained on a vast corpus learns the manifold of that corpus. Inside it the system interpolates, often with startling fluency, and much of what reads as reasoning is very high-dimensional interpolation across an enormous training distribution — genuinely powerful, not to be sneered at. But interpolation and extrapolation are different operations with different guarantees. Performance inside the training distribution tells you little about performance off it, and "off it" is exactly where novel scientific reasoning, genuinely new market conditions, and true out-of-distribution generalization live. The scaling curve is measured by interpolating within a held-out sample of the same distribution. It is silent, by construction, about the extrapolative regime — and the extrapolative regime is where the returns the industry has promised would have to come from.

Then the blunt physical constraint: high-quality human text is a finite stock. The compute-optimal recipe, Chinchilla, says that to use compute well you scale training tokens roughly in step with parameters, which makes the frontier ravenous for data at precisely the moment the supply of high-quality public text is coming into view of exhaustion. Synthetic data is the proposed escape, but training a model on its own outputs risks distribution collapse, the model narrowing toward its own prior — a well-documented failure mode, not a free lunch. When one of the three axes you were scaling along runs out of road, the joint extrapolation you drew is no longer the curve you fit.

Now the deepest problem, and the one I would put money on being underweighted. You cannot distinguish an exponential from the early phase of a sigmoid using finite data. Every logistic curve, every S-curve that eventually saturates, is in its early stretch indistinguishable from pure exponential growth. The bacterial culture growing in fresh medium looks exactly like unbounded growth right up until it hits the carrying capacity it was always heading toward. The inflection is invisible from inside the exponential phase. This is not a vibe. It is a fact about curve-fitting with bounded observations, and it is why the history of technology is a graveyard of trends that were smooth until they weren't.

Moore's law held for decades; then Dennard scaling broke around the mid-2000s, clock speeds flatlined, and the trend that felt like a law of physics turned out to be a temporary regime resting on physical assumptions that expired. Malthus extrapolated a population curve without a model of the demographic transition that bent it. Extrapolating a growth curve without a model of what generates the growth is the central error: a line fit to the past is not a law, it is a description awaiting a mechanism, and only the mechanism tells you which curve you are actually on. Absent a theory of why loss scales as it does, we cannot know whether we are looking at an exponential or the pretty early half of an S, and the two imply wildly different futures from identical present data.

The honest steelman, and why it doesn't rescue the theorem

The strongest argument on the other side is real, and I will state it at full strength: the scaling skeptics have been wrong, repeatedly, embarrassingly, for years. Every confident declaration of an imminent wall has so far been overrun. Capabilities that looked impossibly far off arrived on schedule with the compute. That track record is genuine evidence, and anyone who waves it away is not being serious.

Notice exactly what kind of evidence it is. A regularity that has held through N successive scale-ups raises your rational credence that it holds at N+1. It does not, and cannot, entail it. This is Hume's problem of induction with a GPU cluster attached: no finite sequence of confirmations converts an empirical trend into a necessary truth. The corroboration is worth a great deal. I would bet with the trend over any specific near-term wall, because that is where the evidence points. What the corroboration cannot buy is the boundary conditions. A longer track record makes the bet better. It does not turn the bet into a theorem, because the thing that turns a trend into a theorem is a mechanism, and more data points are not a mechanism.

So hold both halves at once, because both are true. Scaling is the best bet on the board. And it is a bet on an unexplained empirical trend whose stopping conditions are unknown to the people making it.

What the error costs

The failure mode is not "scaling stops." It is the mispricing of a bet that has been laundered into a certainty. When you treat an unexplained curve as a law, you allocate as if the boundary conditions don't exist. You build the org, the capex schedule, and the entire strategic narrative on a single scaling axis, and you leave yourself no instrumentation to detect the inflection until you are past it. A mechanistic theory would give you leading indicators of where the regularity thins out. The pure empiricist has only lagging ones. He finds the edge of the curve by driving off it.

This is where biology matters, and not as metaphor. Complex adaptive systems almost never improve along a single scalar axis without limit; they hit regime changes, phase transitions, and saturating returns dictated by structure the aggregate curve conceals. Treating "more compute" as a lone dial that monotonically produces "more intelligence" is exactly the one-variable thinking I call out in Why Most AI Strategy Is Biologically Illiterate: the reflex to model a multi-dimensional, mechanism-rich system as a smooth curve in one number, because the smooth curve is the part you can see. The map is not the system. The loss is not the mind.

Scaling has bought us something genuinely astonishing, and I expect it to buy more. But an industry cannot keep confusing a curve it can draw with a law it can prove. Kepler was right about the ellipses for eighty years before anyone knew why. The difference is that Kepler never had to raise the next round against the claim that he did.

Frequently asked questions

Is this an argument that scaling will stop working?
No. The claim is narrower and more durable: we have an empirical curve without a mechanism, so we cannot state the conditions under which it holds or breaks. Betting the curve continues is reasonable. Treating its continuation as guaranteed is a category error, because a description of the past is not a law with boundary conditions.
Haven't scaling skeptics been wrong repeatedly?
Yes, and that is the strongest counterargument, which the essay takes seriously. Every confident prediction of a near-term wall has so far been beaten, and that is genuine Bayesian evidence for the trend. It is not evidence for a theorem, because a regularity that has held N times tells you nothing deductive about time N+1. That is Hume's problem of induction, and it is exactly why a mechanism, not a longer track record, is what would close the gap.
What would a real theory of scaling look like?
Something that plays Newton to the current Kepler: an account of why loss falls as a power law in compute that also tells you where the regularity ends — which data distributions, which capabilities, which architectures fall outside it. A mechanism is valuable precisely because it lets you see the edge of a regularity before you reach it, the way general relativity told you in advance where Newtonian gravity would fail.

Filed under Cross-Disciplinary Deep Essays. Where biology, computation, markets, and philosophy collide.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Cross-Disciplinary Deep Essays

The One MAMMAL Result That Ran in a Wet Lab

MAMMAL posts state-of-the-art on nine benchmarks, but the result that matters is four potency predictions on drugs it never saw, confirmed by a real assay. Here's why that one experiment outweighs the leaderboard.

8 min read
Cross-Disciplinary Deep Essays

Why Most AI Strategy Is Biologically Illiterate

Companies deploy AI like installing software. The right model is introducing an organism into an ecosystem, and selection pressure predicts the failure modes the ROI math can't see.

10 min read
Cross-Disciplinary Deep Essays

Prediction Is Not Understanding: The Ceiling LLMs Inherit From Statistics

LLMs model the correlational structure of their training data with astonishing fidelity, but correlation is not causation and fluency is not truth. Knowing where that ceiling sits tells you what to trust them for and what the next paradigm must add.

8 min read