Most of the growth spikes your team celebrates and most of the slumps it panics over were not caused by anything your team did. They are regression to the mean: the boring statistical fact that an extreme measurement tends to be followed by a less extreme one, for no reason other than that extremes are partly luck and luck does not repeat. Treat that gravity as signal — credit the tactic that "caused" the record month, blame the one that "caused" the bad week — and you will systematically reward noise and punish sense. This is not a soft cognitive-bias warning. It is arithmetic, and it is quietly corrupting the numbers you steer the company by.
I learned to fear this in a setting with far higher stakes than a dashboard: biological-aging research, where the same math can manufacture a rejuvenation effect out of thin air and get it published.
Extremeness selects for luck, and luck doesn't persist
Here is the mechanism with no hand-waving. Any measurement you take is a mix of a stable underlying value and transient noise: measurement error, mood, a random good week, whatever the local equivalent is. When you select a unit because its measurement is extreme, you are not just selecting for a high true value. You are also selecting for the cases where the noise happened to point the same direction. The true value is sticky. The noise is not. Measure again, and the true part stays roughly where it was while the lucky noise washes out, so the number drifts back toward the average. Nothing intervened. The drift is guaranteed by the selection itself.
The strength of the effect scales with how much of your metric is noise. A perfectly reliable measure shows no regression at all. A pure-noise measure regresses fully to the population mean, because today's extreme predicts nothing about tomorrow. Every real business metric — weekly signups, a store's conversion, a rep's close rate, a cohort's retention — sits between those poles, and the noisier and shorter the window, the harder it snaps back.
In my world the metric is biological age. You estimate it from DNA methylation patterns, an epigenetic clock, and the estimate carries real technical and biological noise. Now imagine a study that enrolls people whose measured biological age sits far above their chronological age — the worst agers — and gives them an intervention. Measure them again months later and a chunk of them will look younger. Not because the intervention worked. Because you selected them at their noisy extreme, and the noise regressed. The extremeness of the baseline guarantees a less extreme follow-up, and that improvement wears the costume of a treatment effect. Without a control group selected the same way, the fake effect and a real one are literally indistinguishable in the data. I have watched competent people nearly ship exactly this conclusion. The only thing that stopped it was someone insisting on the boring question: would these people have "improved" anyway?
Hold onto that question. It is the whole game.
The same math is in your growth deck
Nothing about the above is biological. It is a property of selecting on extremes, and your operating cadence selects on extremes constantly.
The quarter after the record month. You had your best month ever. Part of that was genuine: the product got better, the market grew. But the reason this month cleared the previous record is disproportionately the transient stuff that all lined up at once — a viral post, a competitor's outage, a big deal that closed on the 30th instead of the 2nd. Those don't repeat on schedule. Next month regresses, and now there's a narrative meeting about what "broke." Usually nothing broke. You are watching noise you mistook for a new baseline return to the actual baseline.
The cohort you intervened on because it was worst. This is the aging study in disguise, and it is probably the most expensive version. You look across regions, or segments, or reps, and you act on the worst performer: a new playbook for the bottom store, a rescue campaign for the worst-retaining cohort. It improves. Of course it does. You chose it because it was at its unlucky extreme, and unlucky extremes regress upward on their own. The intervention collects the credit that gravity earned. Then you roll the playbook out everywhere, it does nothing to the median, and no one connects the disappointment back to the fact that the original "win" was never real.
The "our fix worked" story after a bad week. Bad weeks are, by definition, partly bad luck. Ship a fix on Monday and the following week is better — because the bad luck ended, not necessarily because the fix did anything. The tighter your reaction loop, the more of these phantom wins you accumulate, and the more confident the team becomes in tactics that have never actually been tested against the counterfactual of doing nothing.
At Kommerce we live at the sharp end of this, because the whole product sits in trust-scarce, cash-on-delivery markets where the core metrics — delivery success, confirmation rates, return rates — are volatile by nature. A single unreliable courier or one regional holiday moves a merchant's weekly numbers enough to look like a trend. Early on it is genuinely tempting to attribute every good week to the last thing you shipped. The markets are noisy enough that regression to the mean isn't a footnote here; it's most of the week-to-week variance, and a team that hasn't internalized that will spend its life chasing its own tail.
Attribution is a causal claim, and you keep making it by accident
The through-line: every time you say "X drove the number," you have made a causal claim, and made it without the machinery causality actually requires. "We did X, then the metric moved" establishes correlation and temporal order. It does not touch the counterfactual — what the metric would have done without X — and the counterfactual is the entire content of the word "caused." Regression to the mean is dangerous precisely because it produces a rock-solid correlation and a clean before/after story while the true causal effect is zero. It is a machine for generating convincing evidence of things that aren't there.
I've argued the general version of this elsewhere: nearly every AI and analytics decision that looks like a prediction problem is really a causal-inference problem wearing a prediction costume, and confusing the two is how sophisticated teams end up optimizing correlations that evaporate on contact with reality. Regression to the mean is the cleanest, most common special case, the one where the confound is nothing exotic — just the selection you performed on yourself.
There's a diagnostic reflex worth borrowing from clinical medicine here. A good physician doesn't accept the first explanation that fits the symptom; they force themselves to enumerate what else would produce the same presentation before committing. The discipline that separates a real diagnosis from a plausible story is the same one that separates a real signal from a confident hallucination, human or machine. "The intervention worked" and "the extreme baseline regressed" produce identical charts. If your process can't tell them apart, your process cannot tell you what caused your growth.
What operators can actually do
You will never eliminate regression to the mean. It is a property of noisy measurement, and all measurement is noisy. What you can do is stop being fooled by it, cheaply.
Compare against a control selected the same way. This is the highest-leverage move and the one people skip because it feels slow. When you intervene on the worst cohort, hold back a slice of similarly-bad cohorts and do nothing to them. If both groups improve by the same amount, your intervention did nothing and you just saved yourself from scaling a placebo. If the treated group improves more, that gap is real and it's yours. The control has to share the selection criterion; comparing your rescued worst store to an average store re-imports the exact bias you're trying to remove.
Write down the prediction before you look. Pre-registration sounds academic; in practice it's one Slack message. Before the intervention, state what you expect the metric to hit and why, and — this is the part that does the work — what you'd expect without the intervention. That kills hindsight attribution, where any outcome gets retrofitted to whatever you happened to ship. If you didn't predict it in advance, you don't get to claim you caused it in retrospect.
Look at the whole distribution, not the tail you're standing on. The record month and the worst cohort are both tail observations, and tails are where luck concentrates. Pull back to the full distribution over a longer horizon. A "spike" inside the normal spread of your weekly variance is not an event; it's a sample. Most of what gets escalated in a Monday standup is a draw from a distribution the team never bothered to characterize. Know your metric's ordinary noise band and most false alarms disqualify themselves.
Ask the counterfactual out loud: would this have gotten better anyway? Make it a required field. Before crediting any tactic, someone has to argue what the number would have done under nothing. If the honest answer is "it probably would have regressed up regardless," you have not learned that your tactic works. You've learned that you can't yet tell.
None of these require a data-science team or a stats degree. They require one habit: treating "we did X and it worked" as a hypothesis to be attacked, not a result to be celebrated. That is a cultural setting, not a technical one, and it is the difference between an organization that compounds real learning and one that accumulates confident folklore about what drives its own growth.
The uncomfortable part is that this discipline mostly takes wins away from you. The record month becomes "a good draw." The rescued cohort becomes "unproven." Half the tactics in the internal wiki become "never actually tested." That subtraction is the point. A number you can't yet explain is worth more than a cause you invented to explain it, because only one of them survives contact with next quarter.