The best mental model I have for where AI agents help in research is not "hard versus easy." It's "how fast and how cheap is the check." Anything where the loop closes in seconds and for free — code runs or it throws, a query returns rows or it errors, a proof type-checks or it doesn't — an agent compresses from hours to minutes. Anything gated by a wet-lab experiment that takes three weeks and a few thousand dollars of reagent to tell you whether you were right, the agent cannot touch, and no amount of model capability changes that. The dividing line runs straight through my week, and it does not track difficulty at all.
I want to give an honest ledger, because most accounts of this are written by people selling something. I run computational-biology work — epigenetic aging clocks, lncRNA, causal inference on microbiome data — and I ship software on the side. I use these tools every day, in both halves of that life. They have genuinely changed how I work. They have also left the part of science that actually costs time and money exactly where it was.
Where the loop is fast, agents are already indispensable
Start with the surprising one: literature synthesis. Not summarizing a single paper, which is a parlor trick, but holding more of a field in working memory than I can. When I open a new problem I want to know what has already been tried, what failed quietly, which effect sizes replicated and which evaporated. A human postdoc reads maybe fifteen papers deeply before pattern-matching from memory and missing things. An agent will read fifty, cross-reference the methods sections, and flag where two papers used the same term for different constructs. The check is cheap: I can spot-verify a claim against the source in thirty seconds. Fast loop, so it works.
Second: enumerating hypotheses and, more valuably, alternative explanations. This is the one I would defend hardest. The default failure mode of a working scientist is confirmation bias — you get a result you like and you stop looking. When I see a clean signal in aging-clock data, the responsible next move is to list every boring reason it might be an artifact before I let myself believe it. Batch effect: were the young and old samples run on different plates, different days, different technicians? Cell-composition confounding: is my "epigenetic age" signal really just a shift in the fraction of immune cell types in the blood? Technical drift in the array? An agent generates that list faster and more completely than I will at 6pm on a Friday when I want the result to be real. It does not know which explanation is true. It is very good at making sure I rule each one out on purpose rather than by omission.
Third, and least surprising: code. Writing and debugging analysis pipelines is where the fast-loop advantage is most naked. The verifier is built in — the script runs or it doesn't, the test passes or it doesn't, the plot renders or throws. An agent iterates against that signal hundreds of times without me. Data wrangling, reshaping, format conversion, the tedious glue between one tool's output and another tool's input: this used to eat a genuine fraction of every project and now largely doesn't. Catching statistical errors belongs here too. "You're doing a t-test on data that's clearly not normal." "You have 40,000 features and 60 samples and no multiple-testing correction — your p-values are decorative." "This is pseudoreplication; your n is the number of animals, not the number of cells." A reviewer catches these in two months. The agent catches them today, before I've wasted the two months.
Fourth: drafting and critiquing experimental designs — but only the paper part of the design. Sample-size logic, the shape of the controls, whether the design can actually distinguish the two hypotheses or is confounded from birth, what the analysis plan should be. This is reasoning over a specification, and specifications have fast checks. You can find the flaw in a design by thinking, in an afternoon, for free. You cannot find the flaw in the biology that way.
The thread through all four: the check is fast and cheap. That is the entire property. It is why the same class of model that feels miraculous in my terminal feels inert at my bench.
Where the loop is slow, agents stall — and it isn't about intelligence
Here is the part the hype skips. Every genuine acceleration above lives upstream of the actual constraint. In experimental biology the rate-limiting step is the causal loop: to learn whether an intervention does what you think, you have to perturb a living system and wait for it to answer. That answer takes weeks. It costs money and irreplaceable sample. It comes back noisy, so you often have to do it again. No agent shortens that loop, because the loop is made of cell-division time and reagent cost and the physical world's refusal to be queried faster than it runs.
This is why a system that enumerates a hundred plausible next experiments does not help me as much as it looks. Generating directions was never my bottleneck. Choosing which one of the hundred is worth a month of a postdoc's life is the bottleneck, and it is a judgment made under deep uncertainty where the agent has no privileged information. It can tell me which experiments are logically informative. It cannot tell me which will work, because "will work" depends on tacit facts: that this antibody is unreliable below a certain concentration, that this cell line drifts after passage twenty, that the assay everyone cites in the methods section actually needs a tweak nobody published to give clean data. That knowledge lives in the hands of people who have run the protocol two hundred times. It is not in the training data because it was never written down. Bench science is full of it, and it is exactly the part that separates a result that replicates from one that doesn't.
There's a subtler failure too. The agent's fluency is calibrated to the fast-loop tasks, and it carries the same confidence into the slow-loop ones, where confidence is unearned. It will propose an elegant experiment with a critical practical flaw — a readout confounded by the very manipulation you're using — and state it in exactly the tone it used for the code that ran on the first try. In the terminal, wrong is cheap; you see the traceback in a second. At the bench, that same wrongness costs three weeks and you find out at the end. The asymmetry in the cost of being wrong is the whole game, and the agent doesn't feel it.
So the honest ledger reads: agents have compressed everything around the experiment and left the experiment itself untouched. The shape of my week has changed. I spend far less time on code and lit review and far more of the freed time on the thing that was always the real work, deciding what is worth running. The bottleneck didn't vanish. It got more concentrated, and more clearly the human's.
The discipline that turns this from a toy into leverage
The people getting real value from these tools in research are not the ones with the cleverest prompts. They're the ones who can say precisely what they want checked. "Make this better" gets you nothing, at the bench or in the code. "Here is my design, here are the two hypotheses it must distinguish, here are the confounds I already know about — find the one I've missed" gets you something that can save a month. The skill is writing the specification tightly enough that a fast checker can act on it, which is why problem specification is the capability that outlasts prompt engineering: the value was never in the phrasing, it was in knowing exactly what "correct" means for this task, precisely enough that a machine can test against it.
That reframes the whole discourse about automating science. The dream is a system that closes the loop end to end: reads the literature, forms the hypothesis, runs the experiment, updates, repeats. But the two ends of that loop are not the same kind of thing. The reading-and-reasoning end has a fast, cheap check and is already substantially automatable. The running-and-learning end is bounded by physics and money and tacit skill, and treating those two as one continuous capability is a category error: the error of assuming that because generation got cheap, discovery did. Generation was never the scarce thing. The scarce thing is the slow, expensive contact with reality that tells you which of your beautiful generations is true.
I am, for the record, an optimist about this. The compression of the fast-loop work is real and it is large, and I would not give it back. The optimism has a precise shape. These tools make me a faster reasoner, a faster coder, and a much more paranoid checker of my own results. They do not make the universe answer my questions any faster. The rate limit on knowing something new about a living system is still set by how long it takes that system to tell you, and it is not taking calls.