The scaling hypothesis — that model loss falls predictably as you add parameters, data, and compute — is the most successful empirical regularity in the recent history of machine learning. It is also an explanation of nothing. A curve that has fit the last several orders of magnitude is a description of what happened, not an account of why, and the two are not interchangeable. The capital structure of frontier AI rests on the assumption that a line nobody can explain will keep going straight. That is a defensible bet. It is not a theorem, and the entire strategic question is whether the people allocating the capital know the difference.
I want to argue this precisely, because both camps argue it badly. The boosters treat the scaling curve as a law of nature and reason forward from it as if from Newton. The skeptics point at each temporary plateau and declare the wall. Both make the same mistake in opposite directions: they read a mechanism off a trend line that contains no mechanism. What the trend line actually licenses is narrower, stranger, and more useful to understand than either tribe admits.
Kepler had the curve; Newton had the reason
Johannes Kepler spent years buried in Tycho Brahe's observations and extracted three laws of planetary motion. Orbits are ellipses. A planet sweeps equal areas in equal times. The square of the orbital period is proportional to the cube of the semi-major axis. Correct, predictive, quantitatively exact — and completely silent about why. Kepler had fit the data. He had no idea what generated it.
Newton supplied the generator: an inverse-square law of gravitation from which Kepler's three laws fall out as theorems. In doing so he did something curve-fitting cannot do at any level of precision. He told you the boundary conditions. He told you the two-body ellipse is an idealization, that a third mass perturbs it, that the regularity is a limiting case of something deeper. This is why, when Mercury's perihelion was later found to precess by a few extra arcseconds per century, the anomaly was legible. There was a mechanism precise enough to be violated, and the violation pointed at general relativity. A mechanism is what lets you see the edge of a regularity before you drive off it.
Scaling laws are Kepler without Newton. We have an extraordinary curve, loss falling as a clean power law across many orders of magnitude of compute. We do not have the because. Without the because, we have no principled way to say where the curve ends — which is precisely the thing a company betting fifty billion dollars on the next order of magnitude most needs to know.
What the curve actually says, and the arithmetic hiding inside it
Look at the functional form. The empirical scaling laws describe test loss as roughly
L(N) = E + A · N^(−α)
where N is model size (the same shape holds for data and compute), and the fitted exponents α are small — in the published estimates, on the order of 0.05 to 0.1. Two features of this equation matter and are almost never stated plainly to the people writing the checks.
First, a power law with a tiny exponent is a punishing master. Take α ≈ 0.076, near the original published value for parameter scaling. To halve the reducible part of the loss, L − E, you multiply N by 2^(1/α) = 2^13.2, roughly nine thousand. Nine thousand times the parameters to cut the remaining reducible loss in half, and nine thousand times again for the next halving. The curve is smooth and it is real, and it is also an escalator into a wall of cost. Every constant-factor gain demands an exponentially growing input. That is not a wall in the sense of stopping. It is a wall in the sense of economics.
Second, and more fundamental: the term E is an asymptote. It is the irreducible loss, the intrinsic entropy of the text, the part no amount of modeling can predict because it is genuinely unpredictable given the context. The power law bakes its own ceiling into its own equation. As compute goes to infinity, loss does not go to zero; it goes to E. The very curve people extrapolate optimistically is, on its face, a description of asymptotic diminishing returns toward a floor we cannot compute in advance. The optimism is not in the equation. It lives in a belief about what happens to capabilities as loss creeps toward E, and that belief is a separate, unexamined thing.
The map from loss to capability is unexplained and not monotone
Here is the move everything downstream depends on and that no one has justified: the identification of falling loss with rising intelligence. Scaling laws are about cross-entropy loss on next-token prediction. They are not about reasoning, planning, truthfulness, or any of the capabilities anyone actually cares about. The claim that lower loss reliably buys those capabilities is not part of the scaling law. It is an empirical hope layered on top of it.
And the map from loss to capability is demonstrably not clean. The debate over "emergent abilities" made this concrete: several capabilities that appeared to switch on suddenly at scale turn out, under smoother evaluation metrics, to improve gradually — the emergence was partly an artifact of how we chose to score. Cut the other way, the Inverse Scaling Prize documented real tasks where larger models get reliably worse, where more scale moves the wrong direction. So the loss curve can glide smoothly downward while the capability you care about is flat, jumping, or regressing, depending on the task and the metric. You cannot read capability off loss. The one quantity scaling laws actually predict is not the quantity the industry is selling.
This is the same error I keep flagging in a different domain, a predictive artifact mistaken for the thing itself, which I argue at length in Prediction Is Not Understanding. A model can drive its loss down by becoming a better next-token predictor without acquiring the underlying structure we impute to it, exactly as a regression can nail the fit without capturing the cause. Low loss is a prediction. Intelligence is an explanation. Conflating them is not a technicality. It is the load-bearing assumption of a trillion-dollar sector.
Interpolation is not extrapolation, and the data is finite
Two structural headwinds sharpen the point, both about the difference between filling in a distribution and going beyond it.
A model trained on a vast corpus learns the manifold of that corpus. Inside it the system interpolates, often with startling fluency, and much of what reads as reasoning is very high-dimensional interpolation across an enormous training distribution — genuinely powerful, not to be sneered at. But interpolation and extrapolation are different operations with different guarantees. Performance inside the training distribution tells you little about performance off it, and "off it" is exactly where novel scientific reasoning, genuinely new market conditions, and true out-of-distribution generalization live. The scaling curve is measured by interpolating within a held-out sample of the same distribution. It is silent, by construction, about the extrapolative regime — and the extrapolative regime is where the returns the industry has promised would have to come from.
Then the blunt physical constraint: high-quality human text is a finite stock. The compute-optimal recipe, Chinchilla, says that to use compute well you scale training tokens roughly in step with parameters, which makes the frontier ravenous for data at precisely the moment the supply of high-quality public text is coming into view of exhaustion. Synthetic data is the proposed escape, but training a model on its own outputs risks distribution collapse, the model narrowing toward its own prior — a well-documented failure mode, not a free lunch. When one of the three axes you were scaling along runs out of road, the joint extrapolation you drew is no longer the curve you fit.
The graveyard of smooth trends
Now the deepest problem, and the one I would put money on being underweighted. You cannot distinguish an exponential from the early phase of a sigmoid using finite data. Every logistic curve, every S-curve that eventually saturates, is in its early stretch indistinguishable from pure exponential growth. The bacterial culture growing in fresh medium looks exactly like unbounded growth right up until it hits the carrying capacity it was always heading toward. The inflection is invisible from inside the exponential phase. This is not a vibe. It is a fact about curve-fitting with bounded observations, and it is why the history of technology is a graveyard of trends that were smooth until they weren't.
Moore's law held for decades; then Dennard scaling broke around the mid-2000s, clock speeds flatlined, and the trend that felt like a law of physics turned out to be a temporary regime resting on physical assumptions that expired. Malthus extrapolated a population curve without a model of the demographic transition that bent it. Extrapolating a growth curve without a model of what generates the growth is the central error: a line fit to the past is not a law, it is a description awaiting a mechanism, and only the mechanism tells you which curve you are actually on. Absent a theory of why loss scales as it does, we cannot know whether we are looking at an exponential or the pretty early half of an S, and the two imply wildly different futures from identical present data.
The honest steelman, and why it doesn't rescue the theorem
The strongest argument on the other side is real, and I will state it at full strength: the scaling skeptics have been wrong, repeatedly, embarrassingly, for years. Every confident declaration of an imminent wall has so far been overrun. Capabilities that looked impossibly far off arrived on schedule with the compute. That track record is genuine evidence, and anyone who waves it away is not being serious.
Notice exactly what kind of evidence it is. A regularity that has held through N successive scale-ups raises your rational credence that it holds at N+1. It does not, and cannot, entail it. This is Hume's problem of induction with a GPU cluster attached: no finite sequence of confirmations converts an empirical trend into a necessary truth. The corroboration is worth a great deal. I would bet with the trend over any specific near-term wall, because that is where the evidence points. What the corroboration cannot buy is the boundary conditions. A longer track record makes the bet better. It does not turn the bet into a theorem, because the thing that turns a trend into a theorem is a mechanism, and more data points are not a mechanism.
So hold both halves at once, because both are true. Scaling is the best bet on the board. And it is a bet on an unexplained empirical trend whose stopping conditions are unknown to the people making it.
What the error costs
The failure mode is not "scaling stops." It is the mispricing of a bet that has been laundered into a certainty. When you treat an unexplained curve as a law, you allocate as if the boundary conditions don't exist. You build the org, the capex schedule, and the entire strategic narrative on a single scaling axis, and you leave yourself no instrumentation to detect the inflection until you are past it. A mechanistic theory would give you leading indicators of where the regularity thins out. The pure empiricist has only lagging ones. He finds the edge of the curve by driving off it.
This is where biology matters, and not as metaphor. Complex adaptive systems almost never improve along a single scalar axis without limit; they hit regime changes, phase transitions, and saturating returns dictated by structure the aggregate curve conceals. Treating "more compute" as a lone dial that monotonically produces "more intelligence" is exactly the one-variable thinking I call out in Why Most AI Strategy Is Biologically Illiterate: the reflex to model a multi-dimensional, mechanism-rich system as a smooth curve in one number, because the smooth curve is the part you can see. The map is not the system. The loss is not the mind.
Scaling has bought us something genuinely astonishing, and I expect it to buy more. But an industry cannot keep confusing a curve it can draw with a law it can prove. Kepler was right about the ellipses for eighty years before anyone knew why. The difference is that Kepler never had to raise the next round against the claim that he did.