Skip to content

The One MAMMAL Result That Ran in a Wet Lab

MAMMAL posts state-of-the-art on nine benchmarks, but the result that matters is four potency predictions on drugs it never saw, confirmed by a real assay. Here's why that one experiment outweighs the leaderboard.

By Mehdi8 min read
Share
On this page

Of everything in the MAMMAL paper — nine state-of-the-art benchmark wins, a headline duel with AlphaFold3 — the result that actually counts is the smallest and least glamorous one. IBM Research and Technion took a model that had never seen four particular drugs, asked it to rank them by potency, and then ran the experiment. The model said Carfilzomib > Nintedanib > Infigratinib > Vemurafenib. The wet lab agreed, exactly. That single confirmed ranking is worth more than the entire leaderboard, and it is worth understanding precisely why.

The reason is not sentiment about "real biology." It is an asymmetry in cost. Benchmark state-of-the-art is cheap and endlessly gameable: you fine-tune on a task, you compare against whatever specialized models happened to publish, you clear a 1% relative-improvement bar, and you book the win. A held-out prediction that survives contact with a physical assay is expensive, slow, and rare. MAMMAL produced one. In a field where roughly 90% of drug candidates fail before regulatory approval, the ability to be right about a molecule you have never encountered is the only capability that eventually pays for itself.

What the experiment actually showed

Here is the setup, stated accurately, because the details are the whole argument.

The authors selected four drugs and held them out of the GDSC cancer-drug-response training data the model learned from. MAMMAL predicted their relative potency ordering: Carfilzomib most potent, then Nintedanib, then Infigratinib, then Vemurafenib least potent. They then ran the standard GDSC protocol in the lab — cell viability measured with CellTiter-Glo after 72 hours of drug incubation, IC50 fit in Prism. The measured ranking on the tested cell lines matched the prediction exactly. Pushed back through all 805 GDSC cell lines computationally, the ordering held in about 90 to 95% of them, which suggests these particular potency gaps are largely cell-line-independent rather than a lucky artifact of one genetic background.

That much is a clean result. But the part that makes it a generalization result rather than a retrieval result is the chemistry. Three of the four drugs — Carfilzomib, Nintedanib, Infigratinib — had no structurally similar compound anywhere in the training set, defined as a Tanimoto coefficient below 0.7 against everything the model had seen. Only Vemurafenib had a near neighbor: moderate similarity (0.82) to PLX-4720, a BRAF inhibitor present in GDSC.

Sit with that distinction, because it is the one that separates a real signal from a flattering one.

Interpolation proves nothing; extrapolation is the test

A model that gets the right answer only when the query sits near a training example has demonstrated a lookup table with good manners. This is the failure mode that makes so many "AI predicts X" papers hollow: the held-out test set is drawn from the same distribution as training, the novel examples are novel in label but not in structure, and the model is quietly interpolating between neighbors it already memorized. The performance is real in the narrow sense and meaningless in the sense that matters, because drug discovery is precisely the business of asking about molecules that are not like the ones you already have data for.

The three Tanimoto-distant drugs are the answer to that objection. Correctly ranking compounds with no close structural analog in training is extrapolation into chemistry the model was never shown. That is the thing you actually want a foundation model to do, and it is the thing benchmark tables systematically over-reward you for faking. When I read a comp-bio result, the first question is never "what was the metric" — it is "how far from the training manifold did the query sit." Here the honest answer is: far, on three of four, and the model still ordered them right.

That MAMMAL can do this at all is downstream of a specific architectural choice worth naming. Rather than binning or discretizing numerical values the way many sequence models do — which throws away quantitative precision exactly where you need it, in affinity and IC50 regression — MAMMAL projects native numbers into continuous embeddings through a learned layer. Potency is a continuous quantity. A model that quantizes it into buckets is structurally handicapped for ranking; one that keeps the numbers as numbers is not. The wet-lab ranking is, in part, that design decision cashing out.

Generation is cheap; falsification is the work

There is a deeper reason to privilege the assay over the leaderboard, and it is epistemic, not methodological. Producing predictions is easy. Any sufficiently large model will generate a ranking, a binding score, a structure, a plausible-looking answer for any input you hand it. Generation is not where the difficulty lives. The difficulty — the entire load-bearing act of science — is falsification against reality: constructing the one measurement that could have proven you wrong and finding out that it didn't.

This is the flaw in the fantasy of the fully automated discovery engine, which I've argued elsewhere is a category error. A model that proposes hypotheses is doing the cheap half. The expensive half is the wet lab, the 72-hour incubation, the physical world declining to care what your loss curve looked like. MAMMAL's benchmark wins are the cheap half done well. The Carfilzomib ranking is the rare case where the expensive half was actually run and the prediction survived it. One assay is not many, but one real falsification test passed outranks a page of numbers that were never exposed to the possibility of being wrong.

Which is also the honest way to read the eleven benchmarks. MAMMAL is state-of-the-art on nine of them and competitive on two, and some of those margins are genuinely large — a 28.5% relative jump on protein-protein interaction ddG (SKEMPI S1131, Pearson 0.663 to 0.852), sequence-only, landing within 1.6% of the best structure-based method at 0.866. That is a real result and I don't want to wave it away. But every one of those numbers lives on the interpolation-flattered side of the ledger: fine-tuned to the task, scored against a fixed test set, compared only to models that publicly reported. The benchmarks tell you the model is well-built. They cannot tell you it generalizes to reality, because the ground truth they measure against is itself a static, pre-collected artifact. That gap — between what a benchmark certifies and what a physical experiment certifies — is the actual bottleneck in AI drug discovery, and it is exactly the gap the wet-lab experiment steps across.

The same lens deflates the AlphaFold3 comparison that the paper leads with and that will get the most attention. Fine-tuned MAMMAL beat AF3 on 5 of 7 targets when AF3's confidence scores (pTM/ipTM) were used zero-shot as a binder-versus-non-binder proxy — 0.93 versus 0.45 AUROC on HER2, 1.00 versus 0.59 on CD206, and so on. Those numbers look devastating, but read the conditions: AF3 was applied zero-shot to a job it was never designed for (it is a structure predictor, not a binary binding classifier), MAMMAL was fine-tuned with explicit binder and non-binder examples, and the HER2 test set was downsampled to 60 pairs because AF3 is expensive to run. The two tied on TNFalpha, and AF3 actually won on the rigid globular target TBG (MAMMAL 0.63 to AF3's 1.00), which fits the paper's own mechanistic story — sequence models capture the statistical properties of intrinsically disordered regions, which make up 30-40% of the human proteome and which a single-conformation structure model handles poorly. It is an interesting exploratory comparison, and it takes nothing away from AF3, whose development contributed to a Nobel Prize. It is simply not the evidence. The evidence ran in a lab.

The calibrating adult in the room

Now the part where I have to be the physician, not the enthusiast, because the most exciting thread in this paper is also the one most easily oversold.

Carfilzomib is a proteasome inhibitor. It is approved for multiple myeloma — a hematological cancer — and it has limited efficacy in solid tumors. MAMMAL predicted it as the most potent of the four across a panel of solid-tumor cell lines. If that prediction pointed at something real, it would be a repurposing signal: a drug reaching indications it currently doesn't serve. The authors flag it exactly this way, as a hypothesis that warrants further investigation. That is the correct register, and it is worth holding the line on it against the version of this story that a press release would write.

So draw the gap deliberately, because the reader should feel it. Cell lines are not patients. Immortalized cells in a well do not have a vasculature, an immune system, a tumor microenvironment, or a pharmacokinetic profile deciding whether the drug ever reaches the target at a tolerable dose. In-vitro potency is not clinical efficacy; the graveyard of oncology is full of compounds that killed cells beautifully in a dish and did nothing survivable in a person. And one assay is one assay — a confirmed ranking on the tested cell lines, extended computationally to the rest, is a strong early-pipeline signal and nothing more. The honest response to the Carfilzomib prediction is to fund the next experiment, not to announce a therapy.

Both things are true at once, and the discipline is entirely in holding them together. This is the most exciting result in the paper and it is a hypothesis. MAMMAL predicted potency for structurally novel drugs and the wet lab confirmed the ordering — that is a genuine, non-trivial demonstration that the model extrapolates. The Carfilzomib-across-solid-tumors signal is real enough to chase and nowhere near proven enough to believe. A researcher who can only feel one of those at a time — pure hype or pure dismissal — is not doing science; they are doing marketing or its cynical twin.

The model itself deserves the same calibration. MAMMAL is a well-engineered, genuinely open contribution — 458 million parameters, released weights, one unified sequence-to-sequence framework spanning small molecules, protein sequences, and gene-expression rankings, pretrained on two billion samples. It is sequence-only, it does not model 3D structure, and its authors say so plainly. It advances the tooling. What it does not do is collapse the distance between a prediction and a truth.

That distance is the entire game. The leaderboard is where you prove your model is smart. The wet lab is where reality gets a vote — and this time, on chemistry the model had never seen, reality voted yes. Everything else in the paper is a reason to run more experiments. That one result is a reason to believe the experiments are worth running.

Frequently asked questions

What did the MAMMAL wet-lab experiment actually confirm?
MAMMAL predicted a relative potency ranking for four drugs held out of its GDSC training data — Carfilzomib > Nintedanib > Infigratinib > Vemurafenib. A wet-lab assay using the same protocol as GDSC (CellTiter-Glo viability after 72-hour drug incubation, IC50 fit in Prism) confirmed the exact predicted ordering on the tested cell lines. Extended computationally across all 805 GDSC cell lines, the model preserved that ordering in roughly 90-95% of cases.
Why does the structurally-novel part matter so much?
Three of the four drugs (Carfilzomib, Nintedanib, Infigratinib) had no structurally similar compound in the training set (Tanimoto coefficient < 0.7). That makes the correct ranking an extrapolation to novel chemistry rather than interpolation near a known training neighbor. A model that only reproduces answers close to its training data has demonstrated memory, not generalization; getting the order right on unseen scaffolds is the meaningful test.
Does this mean Carfilzomib works against solid tumors?
No. Carfilzomib is a proteasome inhibitor approved only for hematological malignancies (multiple myeloma) with limited efficacy in solid tumors. MAMMAL predicting it as most potent across solid-tumor cell lines is a repurposing hypothesis the authors say warrants further investigation, not a proven therapy. Cell lines are not patients and in-vitro potency is not clinical efficacy.
How does MAMMAL relate to the AlphaFold3 comparison in the paper?
Separately from the wet-lab work, the authors compared fine-tuned MAMMAL against AlphaFold3 confidence scores used zero-shot as a binder-versus-non-binder proxy, and MAMMAL was better on 5 of 7 targets. That comparison is exploratory (small, downsampled test sets; AF3 was never designed as a binary binding classifier; AF3 won on TBG and tied on TNFalpha) and is not the paper's strongest evidence. The wet-lab ranking is.
What is MAMMAL and is it available?
MAMMAL is a cross-modal biomedical foundation model from IBM Research and Technion, published in npj Drug Discovery (2026). It unifies small molecules (SMILES), proteins and antibodies (amino-acid sequences, no 3D structure), and gene expression (ranked gene lists) in one sequence-to-sequence framework. The released 458M-parameter model (ibm/biomed.omics.bl.sm.ma-ted-458m) is open-source on Hugging Face and GitHub.

Filed under Cross-Disciplinary Deep Essays. Where biology, computation, markets, and philosophy collide.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Cross-Disciplinary Deep Essays

Why Most AI Strategy Is Biologically Illiterate

Companies deploy AI like installing software. The right model is introducing an organism into an ecosystem, and selection pressure predicts the failure modes the ROI math can't see.

10 min read
Cross-Disciplinary Deep Essays

Scaling Is Not a Theory of Intelligence

The scaling hypothesis is the most successful empirical regularity in the history of machine learning and an explanation of nothing. The industry has bet its capital structure on a line it cannot explain continuing straight.

9 min read
Cross-Disciplinary Deep Essays

Prediction Is Not Understanding: The Ceiling LLMs Inherit From Statistics

LLMs model the correlational structure of their training data with astonishing fidelity, but correlation is not causation and fluency is not truth. Knowing where that ceiling sits tells you what to trust them for and what the next paradigm must add.

8 min read