Skip to content

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

MAMMAL's real contribution is not a benchmark win. It's a bet that molecules, proteins, and gene expression can share one sequence-to-sequence language — and a 458M-parameter generalist that proves the bet pays.

By Mehdi8 min read
Share
On this page

MAMMAL's real contribution is not that it topped a benchmark table. It is the wager underneath the table: that three biomedical modalities as physically different as small molecules, proteins, and a cell's transcriptional state can be written in one sequence-to-sequence language — and that a single 458-million-parameter generalist trained across all three can match or beat the specialized models built for each task individually.

IBM Research and Technion released that model — MAMMAL, in npj Drug Discovery (2026, vol 3, article 14) — and it reached state-of-the-art on 9 of 11 drug-discovery benchmarks while staying competitive on the other two. The headline reads like every other foundation-model paper. Look past it. The unification is the story, not the leaderboard.

That is a categorically different bet than the frontier labs are making, and it is not a bet on scale. The released model, ibm/biomed.omics.bl.sm.ma-ted-458m, is small and open — on Hugging Face and GitHub. It won on representation, alignment, and task design.

Three modalities, one sequence

Start with why this is hard. A small molecule is a graph of atoms and bonds. A protein is a linear polymer of amino acids that folds into a specific shape. Gene expression is a vector of counts over ~20,000 genes describing what a cell is doing right now. These live in different mathematical worlds, and the field has historically built a different model for each — a graph neural net for molecules, a protein language model for sequences, a specialized encoder for single-cell data.

MAMMAL forces them into a common grammar. Small molecules go in as SMILES strings. Proteins and antibodies go in as raw amino-acid sequences — sequence-only, no 3D structure. Gene expression goes in as a ranked list of gene names, ordered from most to least expressed. Everything becomes a token stream, and the model, built on the transformer / T5 lineage, is trained sequence-to-sequence across all of it: 2 billion samples drawn from six public datasets (OAS antibodies, UniProt/UniRef90 proteins, ZINC and PubChem molecules, STRING protein-protein interactions, CELLxGENE single-cell), spanning three domains and seven pretraining tasks.

Two mechanisms make this cohere, and both are underrated relative to the benchmark table.

Mechanism one: a prompt syntax that lets entities interact

The first is a structured, multi-domain prompt syntax paired with a modular tokenizer — sub-tokenizers per entity type, so an amino-acid sequence and a SMILES string are each tokenized in their own alphabet but assembled into one prompt. This is what turns "three models in a trench coat" into an actual multi-alignment framework. Because a drug and a target protein can occupy the same prompt, the model can be asked about their interaction — drug-target binding, or the effect of a point mutation on binding free energy — rather than embedding each entity in isolation and stapling the vectors together afterward.

This matters because biology is relational. Nothing in drug discovery is about a molecule alone; it's about a molecule against a target, a mutation in a complex, an antibody at an epitope. A framework where heterogeneous entities share a sequence and attend to each other is a better structural match for the questions than a zoo of single-modality encoders.

Mechanism two: numbers as numbers

The second mechanism is the one I'd point a technical reader to first, because it's the kind of detail that gets buried and shouldn't be. MAMMAL feeds native numerical values into the model as continuous embeddings, projecting a real number directly into the embedding space through a learned projection layer — instead of binning or discretizing it.

Here is why that is not a minor engineering choice. The default trick for getting a number into a language model is to bucket it: turn a binding affinity into "low / medium / high," or carve a continuous range into 32 bins and emit a token per bin. That is lossy in exactly the place a pharmacologist cares about. Binding affinity and IC50 span orders of magnitude, and the decisions that matter often live inside a bin. A 10 nM binder and a 40 nM binder can be the same drug program's success or failure, and a coarse binning collapses them into the same token. You have discarded the signal before the model ever sees it. For genuine regression — dose-response, affinity, IC50 — discretization caps your ceiling no matter how good the rest of the architecture is. Projecting the raw value preserves the quantitative precision the task is actually about. It's a small layer with an outsized effect on which problems the model can even be honest about.

Reading the results honestly

Nine of eleven at state-of-the-art invites a healthy reflex to discount it. So read the texture. Most of the wins are modest — the kind of +2% to +4% that is real but not revelatory. "Outperform" here means a relative improvement above 1%, and each result is a fine-tune of the pretrained model compared against the specialized SOTA that publicly reports on that benchmark.

Task Metric SOTA → MAMMAL Δ
Cell type annotation (Zheng68k PBMC) F1 0.710 → 0.763 +7.5%
BBBP (blood-brain-barrier) AUROC 0.937 → 0.957 +2.2%
ClinTox (clinical toxicity) AUROC 0.948 → 0.986 +4.0%
Cancer-drug response (GDSC1) Pearson 0.887 → 0.917 +3.4%
Cancer-drug response (GDSC2) Pearson 0.900 → 0.931 +3.4%
Cancer-drug response 3 Pearson 0.923 → 0.928 +0.5% (competitive)
Antibody CDRH3 infilling AA recovery 0.375 → 0.446 +19%
Antibody-antigen binding (HER2) AUROC 0.924 → 0.928 +0.4%
TCR-epitope binding AUROC 0.862 → 0.879 +2.0%
PPI ddG (SKEMPI S1131) Pearson 0.663 → 0.852 +28.5%
Drug-target interaction (BindingDB) NRMSE ↓ 0.942 → 0.906 +3.8%

Two rows are doing the real argumentative work, and they are the ones that test whether cross-modal pretraining actually transfers rather than just tying the incumbent.

The first is protein-protein interaction ddG on SKEMPI S1131 — predicting how a mutation changes binding free energy. MAMMAL moves Pearson from 0.663 to 0.852, a +28.5% jump, and it does this sequence-only, coming within 1.6% of the best structure-based method (0.866). Mutation effects on binding are conventionally a structural problem; you want to see the interface, the packing, the hydrogen bonds a substitution breaks. That a model with no 3D input lands a hair under the structure-based ceiling says the shared representation learned something about interaction physics from sequence statistics alone. That is not a task-specific tweak. That is transfer.

The second is antibody CDRH3 infilling at +19% (amino-acid recovery 0.375 to 0.446). CDRH3 is the most variable, most functionally decisive loop in an antibody — the hardest part to design and the part that most determines what the antibody binds. A generalist beating specialists on the hardest sub-problem in antibody design is the kind of result that only shows up when the representation is genuinely good, not when you've overfit an easy benchmark.

I'd also flag the one place MAMMAL is honestly competitive rather than dominant, because it's revealing: antibody-antigen binding on HER2 mutants, +0.4% (0.924 → 0.928), where the SOTA it nearly matches used structural data and MAMMAL did not. Drawing even with a structure model while blind to structure is, in context, a stronger result than the small delta suggests.

The exploratory AlphaFold3 comparison sits alongside this — where fine-tuned MAMMAL out-discriminated AF3's zero-shot confidence scores, used as a binder/non-binder proxy, on 5 of 7 targets — but that deserves its own treatment, and I've written it up separately in why a sequence-only model out-predicted AlphaFold3 on antibody binding. The short version: MAMMAL is sequence-only and AF3 builds full 3D structures, so this is not a claim that MAMMAL is a better structure model; AF3 was never trained as a binary binding classifier, the test sets are small, and AF3 still wins on the rigid target and ties on another. Read it there.

What the wet lab adds

Benchmarks are retrospective. The more interesting evidence is a small prospective test the authors ran. MAMMAL predicted the relative potency ranking of four drugs absent from the GDSC training data — Carfilzomib > Nintedanib > Infigratinib > Vemurafenib — and a wet-lab experiment using the GDSC protocol (CellTiter-Glo viability at 72 hours, IC50 by Prism) confirmed the exact ranking on the tested cell lines. Extended in silico across all 805 GDSC cell lines, the ordering held in roughly 90-95% of cases.

The detail that makes this more than a lucky call: three of the four drugs had no structurally similar compound in the training set (Tanimoto < 0.7). Only Vemurafenib had a close analog (0.82 to PLX-4720, a BRAF inhibitor in GDSC). The model wasn't pattern-matching to a near-neighbor for the novel compounds. One result is worth flagging as a hypothesis, not a finding: Carfilzomib, a proteasome inhibitor approved only for hematological malignancies with limited solid-tumor efficacy, was predicted most potent across diverse solid-tumor lines. The authors call this a repurposing hypothesis warranting investigation — not a therapy. That's the right amount of confidence.

The strategic read

Here is my interpretation, marked as such. MAMMAL is a data point against the assumption that only scale advances capability. It is 458M parameters — small by any 2026 standard — and open. Its wins did not come from a bigger pretraining run than everyone else's; they came from putting heterogeneous data in one representation, aligning modalities so they can interact, feeding numbers in without lobotomizing them, and choosing tasks where transfer could show. That is an argument for composition and design over brute force, and it lines up with the case for small, composable, boring AI: a modular generalist you can fine-tune and drop into an agentic workflow may be worth more, in a real discovery pipeline, than a frontier model you can neither inspect nor deploy.

The stakes justify the ambition. Around 90% of drug candidates fail before regulatory approval, most of them late and expensive. Better early-stage discrimination — is this molecule likely to bind, to penetrate, to be toxic, to be potent — is where compute buys the most leverage, because it kills bad programs before they consume a decade. Unifying modalities is also a concrete step toward the "virtual cell": one representation in which a perturbation, a protein, and a transcriptional response are the same kind of object.

The authors are clear-eyed about the ceiling, and so should we be. MAMMAL is sequence-only; it does not explicitly model 3D structure, and its benchmark and cell-line results are early-pipeline signals, not clinical efficacy. The open problem they name is the honest one: how to keep the simplicity of a shared sequence-to-sequence language while borrowing the expressive power of specialized structure and diffusion models. Nobody has that yet.

MAMMAL doesn't resolve that tension. It does something more useful for the field's next five years — it shows the unified-language side of the bet has a lot more room than the specialists assumed.

Frequently asked questions

What is MAMMAL and who built it?
MAMMAL (Molecular Aligned Multi-Modal Architecture and Language) is a cross-modal biomedical foundation model for drug discovery from IBM Research (Israel and T.J. Watson) and Technion, published in npj Drug Discovery 2026 (vol 3, article 14). The released model, ibm/biomed.omics.bl.sm.ma-ted-458m, has 458M parameters and is open-source on Hugging Face and GitHub (BiomedSciAI/biomed-multi-alignment).
What does it mean that MAMMAL uses 'one language' for three modalities?
MAMMAL represents small molecules (as SMILES strings), proteins and antibodies (as amino-acid sequences, with no 3D structure), and gene expression (as ranked lists of gene names) inside a single sequence-to-sequence framework. A structured multi-domain prompt syntax with a modular tokenizer lets these heterogeneous entities live in one sequence and interact, so tasks like drug-target binding and mutation-effect prediction are expressed in the same format.
Why is projecting numbers as continuous embeddings a big deal?
Most sequence models bin continuous values like binding affinity or IC50 into discrete buckets, which throws away the fine differences a pharmacologist relies on — a 10 nM and a 40 nM binder can land in the same bin. MAMMAL projects real numbers directly into the embedding space through a learned projection layer, preserving quantitative precision that regression tasks need.
What were MAMMAL's most striking results?
On 11 benchmark tasks it reached state-of-the-art on 9 and was competitive on 2. The two standouts: protein-protein interaction ddG (SKEMPI S1131) improved +28.5% (Pearson 0.663 to 0.852) sequence-only, within 1.6% of the best structure-based method (0.866); and antibody CDRH3 infilling improved +19% (amino-acid recovery 0.375 to 0.446). Most other wins were a more modest +2-4%.
Does MAMMAL replace structure-based models like AlphaFold3?
No. MAMMAL is sequence-only and does not model 3D structure. Its benchmark and wet-lab results are early-pipeline signals, not clinical efficacy. The authors name the open challenge directly: combining the simplicity of sequence-to-sequence modeling with the expressive power of specialized structure and diffusion models.

Filed under Applied AI. AI that ships, not AI that demos.

Essays like this, in your inbox.

Thoughtful essays. No spam. Unsubscribe anytime.

Applied AI

You Can't Evaluate an Agent You Can't Specify

Enterprise agent pilots stall at "impressive demo, never shipped" because teams score final answers while agents operate on trajectories — path-dependent decision sequences where one demo tells you almost nothing.

8 min read