Every clever prompt you have memorized is a bet against the next model release, and you are going to lose that bet. "Take a deep breath and think step by step," the elaborate role-play preambles, the threat-and-bribe scaffolding — each was a workaround for a specific model's specific weakness, and each gets quietly deleted from your repertoire the moment a better model ships. Prompt engineering is a depreciating asset. The skill that appreciates as models improve is the one input a model cannot supply for you: a precise specification of the problem. State the goal, the real constraints, the test that decides success, and the cost of being wrong, and you have given the model the only thing it genuinely cannot generate. Everything else it can now do better than you.
This is not a productivity tip. It is a claim about where the bottleneck is moving, and it should change what you practice.
The tricks were always temporary
Watch what actually happened over the last three model generations. Chain-of-thought prompting was a real discovery — you got materially better reasoning by forcing the model to externalize intermediate steps. Then reasoning models arrived that do it internally, and the instruction became noise or worse. The same fate met few-shot formatting hacks, the "you are a world-class expert" incantations, and most of the jailbreak-adjacent phrasing that circulated as folk wisdom. The pattern is not incidental. A prompt trick is, by definition, a lever that exploits a gap between what the model can do and what it does by default. Closing that gap is the entire objective of the labs shipping the models. You are optimizing against the roadmap of the most capitalized engineering effort in the industry, and you will not win.
Contrast that with what does not depreciate. If you can say exactly what decision an answer needs to serve, which constraints are physical law versus inherited assumption, how you will test whether the output is right, and what it costs you if it is subtly wrong — none of that gets erased by a better model. A stronger model makes a good specification more valuable, because it can now execute against a demanding spec that a weaker model would have botched no matter how well you wrote it. The better the engine, the more the quality of your destination matters. Precision of intent is the input with the longest half-life in the entire stack.
What a specification actually contains
I was trained as a scientist before I built software, and the discipline that transfers most directly is not statistics or coding. It is the construction of a falsifiable hypothesis. A hypothesis that cannot fail a pre-committed test is not a weak hypothesis; it is not a hypothesis at all. Popper's line is the whole trade: a claim earns scientific standing only by forbidding some observable outcome. In practice this means you commit, in advance, to the result that would make you abandon the idea. The moment you let yourself decide after the fact whether the data "basically" confirmed you, you have left science and entered storytelling. Most of the intellectual labor in a real study happens before any data is collected — in specifying the question sharply enough that the answer can hurt.
A prompt to an AI system is a hypothesis about what you want, and almost everyone skips the pre-commitment step. They type the question, receive fluent output, and then decide whether they like it. That is the storytelling failure mode, and it is why so much AI use produces plausible garbage: the output is graded against a target the user never defined, so anything articulate passes. A real specification has four parts, and each is a place where thinking either happened or got skipped.
The decision it serves. An answer is only as well-formed as the choice it feeds. "Summarize this market" is not a specification; "tell me which of these two segments to enter first, given that I can only staff one" is. The decision fixes the granularity, the tradeoffs that matter, and the facts that are load-bearing versus decorative. Most weak prompts are weak because the author has not decided what they will do differently depending on the answer — which usually means the answer does not matter, which means the whole exercise is theater. And once you know the decision, you often find a causal question hiding inside it: not "what correlates with churn" but "what would reduce churn if we changed it," a distinction I have argued is the silent failure point in most AI-assisted business decisions. The model will happily answer the correlational question you asked instead of the causal one you meant, and it looks identical on the page.
Real constraints versus assumed ones. This is where domain expertise earns its keep, and where AI is most useful precisely because it does not share your assumptions. Half the constraints people hand a model are inherited defaults — "it has to integrate with our current billing system," "it must be a mobile app" — that feel like laws of physics but are decisions someone made years ago and never revisited. When I was building Kommerce for cash-on-delivery commerce in markets where card penetration is low and institutional trust is thin, the temptation was to treat "customers pay on delivery because they distrust prepayment" as an immovable constraint. It was actually a symptom of a solvable problem: the real constraint was verifiable trust at the moment of handoff, and once specified that way the design space opened up completely. A model can only help you interrogate a constraint if you have told it which constraints are load-bearing and which are habit. Label them. The ones you cannot give a reason for are usually the ones worth deleting.
The acceptance test. State, before you see the output, how you will know it is good. Not "it looks right" — a specific, checkable criterion. For a piece of analysis: which claim, if false, would sink the whole thing, and how would I verify that claim independently? This is the single highest-leverage habit in AI work, and it is borrowed wholesale from clinical medicine. A physician does not accept the first diagnosis that fits; the discipline is to name what else could produce these findings and actively hunt for the evidence that would rule your leading hypothesis out. The absence of that reflex is exactly why a confident hallucination is so dangerous — it presents the same fluency whether or not it is grounded. An acceptance test is your differential: the pre-committed set of checks that a plausible-but-wrong answer must survive and cannot. Without it, fluency is indistinguishable from correctness, and fluency is the one thing these models never lack.
The asymmetric cost of being wrong. Not all errors are equal, and a specification that treats them as equal is underspecified. If you are drafting internal brainstorming copy, a wrong answer costs you thirty seconds. If you are setting a dosing threshold, a pricing floor, or a legal representation, a wrong answer that reads perfectly can cost you the company or the patient. The cost structure should dictate how much verification the spec demands and how tightly you constrain the model's freedom to improvise. In medicine this is second nature: the pre-test probability and the cost of a miss together determine how aggressively you investigate. A test result does not update your belief in a vacuum — a positive finding for a rare condition in a low-risk patient is still probably a false positive, and treating it as truth is how iatrogenic harm happens. The same Bayesian arithmetic governs model output. How much you trust an answer should be a function of your prior, the stakes, and the independent evidence — never of how confident the prose sounds.
A leadership skill wearing a technical costume
None of the four parts is about phrasing. Naming the decision, separating real constraints from inherited ones, defining the acceptance test, pricing the asymmetry of error — this is the job description of anyone who sets direction for other people. It is what a good executive does when delegating to a team, what a principal investigator does when scoping a study, what a founder does when writing the one paragraph that tells everyone what "done" means. It maps so cleanly onto AI work because a capable model is, functionally, an infinitely fast, infinitely literal, deeply knowledgeable report who has no access to your intent and no incentive to ask. The manager who delegates well to humans — clear goal, explicit constraints, agreed definition of success, understood stakes — already holds the entire skill. The manager who delegates badly by dumping vague requests and hoping produces vague AI output for exactly the same reason.
This is why I distrust the framing that AI is coming to "take" the jobs of knowledge workers. That framing measures the wrong thing. AI collapses the cost of execution and phrasing — the parts of knowledge work that were always closer to labor than to judgment. What it raises, sharply, is the return on knowing what you actually want and being able to recognize when you have it. Those two capacities were always the scarce ones; they were just bundled with enough executional grunt work to hide how rare they are. Strip the grunt work away and the distribution gets brutal. People whose value was fluent execution — the passable memo, the competent analysis, the reasonable code — get compressed toward the model's baseline. People whose value was specification and judgment get amplified by the same technology, because the machine now executes their intent at a speed and quality they could never staff for.
So the anxiety is real but pointed at the wrong target. The question is not "will the model do my job." It is "is my job phrasing, or is it specifying." If it is phrasing, the depreciation has already started and no amount of prompt-craft reverses it, because your competition is the labs. If it is specifying, you have been handed the largest force multiplier of your career, and the correct response is to get radically better at the four things above — to treat every request to a model as a hypothesis you must be able to disconfirm, not a wish you hope comes true.
The people who compound value from these systems are not the ones with the cleverest prompts. They are the ones who can still tell you, precisely, what they were trying to find out, and how they would have known if the answer were wrong.