I opened my Anthropic billing page last weekend and stared at it for a minute. One Saturday of pairing with Claude Code on a small Next.js refactor had cost me $74. Not a full workday. Not a multi-repo migration. A refactor.
Scroll through r/ClaudeAI or the Cursor forum and you'll find the same story on repeat. Developers running long agent sessions are watching $50 to $100 evaporate in an afternoon. The reflex is to blame the model price. That's usually wrong. The model isn't the expensive part. The way the conversation is shaped is.
Most of what you pay for is the agent re-reading its own past.
Where the tokens actually go
Every turn, the agent ships your new message plus a stack of things you didn't type: a system prompt, tool definitions, the file tree it crawled earlier, any files it has opened, and the full back-and-forth so far. By turn 30 it's re-sending turns 1 through 29 along with whatever new question you just asked.
Anthropic publishes a metric called cache read input tokens in the API response. If you log it, the ratio is brutal.
Numbers above are from logging a week of my own Claude Code sessions plus a thread from Simon Willison where he ran the same analysis. Your mileage will vary, but the shape is consistent across everyone who has bothered to instrument it.
Why does this happen at the architecture level? Transformers don't have memory. They have context. Every starts from scratch and reads the entire prompt to produce one token. The "conversation" is an illusion the API builds for you by replaying the transcript on every call. The model isn't remembering. It's re-reading.
And the that does the re-reading is roughly quadratic in sequence length. Doubling your context doesn't double the compute, it closer to quadruples it. That's the bill.
Match the model to the task
Using Opus or GPT-5.5 Pro for everything is the easiest mistake to make because the smart model feels safer. It's also a 15x to 30x cost multiplier on tasks where a smaller model would produce an identical answer.
A grep across a repo doesn't need a frontier reasoning model. Neither does renaming a variable, fixing a TypeScript error, or writing a basic test. Those are tasks where Haiku or GPT-5-mini will produce the same diff as Opus, just cheaper and faster.
Output tokens are where the model choice really stings. Decoding is : the model produces one token, feeds it back into itself, produces the next. It can't be parallelized across the sequence. So output tokens cost 4x to 5x more than input tokens on most providers, and that ratio gets worse on larger models because each forward pass is heavier.
Don't switch models mid-session. The prefix cache is keyed to the model. Drop from Opus to Haiku halfway through and Haiku has to cold-read your entire history at full price. You'll often pay more than if you'd just finished on Opus.
Give the agent a map, not a search task
There's a pattern Andrej Karpathy has written about a few times that he calls the LLM wiki: a small, hand-curated file that tells the model where things live before it starts looking.
The reason this works isn't motivational. It's mechanical. When you ask an agent "find the bug in the login flow," it doesn't know where login lives, so it runs ls, then grep, then opens three files that turn out to be wrong, then opens two more. Every one of those tool calls returns content that gets pinned into context for the rest of the session. You've now paid for 15 tool roundtrips and stuffed maybe 8,000 tokens of mostly-irrelevant file content into the prefix that every future turn will drag along.
A CLAUDE.md that says "auth lives in app/api/auth/, the session check is in lib/session.ts:checkSession" replaces all of that with about 40 tokens. The agent stops exploring and starts working.
Vague questions are expensive twice. They burn tool calls now, and they pollute the prefix that every later turn has to re-read.
Prompt caching, and why it's so easy to break
This is the biggest single lever. Anthropic, OpenAI, and Google all support some form of prefix caching now. The pricing on Anthropic looks like this:
Mechanically, what's cached is the attention for your stable prefix. Computing those tensors is the expensive part of a forward pass. When the provider sees the same prefix bytes on a later request, it skips that computation and reads the precomputed tensors from a memory cache. You pay 10% of the input price instead of 100% for that prefix.
The catch is that the cache is keyed by an exact byte match on the prefix. One byte of drift and the cache miss cascades through everything after it. A surprising number of system prompts inject the current timestamp or a session ID at the top of the message. That single dynamic field invalidates the entire cache, every turn.
Here's the kind of thing that quietly costs you full price on every turn:
system: |
You are a coding assistant. Current time: {{timestamp}}.
Project conventions: [2000 tokens of rules]Drop the timestamp out of the system prompt and move it into the user message instead. The 2,000 tokens of rules then sit in a stable prefix and hit the cache from turn two onward. Dynamic stuff at the tail of the prompt is fine. Dynamic stuff at the head of the prompt is a tax.
Tell the model to shut up
Models are trained on human conversation, and humans pad. They restate the question. They explain what they're about to do. They wrap up with "let me know if you need anything else." All of that is decode tokens, which is the expensive direction.
Telling the model "skip preambles and recaps, output the diff only" reliably cuts output volume by 30 to 50% on conversational turns. I logged this on my own usage for a week. The accuracy on actual code output didn't change. The wrapper around it shrank a lot.
This works because decode is sequential. The model can't think faster by being parallel. Every token saved on output is a forward pass you didn't run. Compare that to input, where the prefix gets processed in one big parallel chunk and then cached. Trimming output is structurally cheaper than trimming input on a per-token basis.
The follow-up correction trap
The single most expensive thing you can do in a long session is reply with "no, do it this way instead." It feels free. It's not.
What that follow-up does is append your correction and the original wrong attempt into history. From this turn forward, every request re-sends both. If you do this five times in a session, you've now baked five wrong attempts into the prefix that the model re-reads on every single subsequent turn. With quadratic attention scaling, the cost of each new turn keeps climbing.
A related move is batching. If you know you need three small changes in the same area, ask for all three in one prompt instead of three sequential ones. The model loads the relevant files once instead of three times, and you avoid two full prefix re-reads. I've seen this drop the cost of a multi-step task by roughly 3x in my own logs.
Keep CLAUDE.md lean
This is the easiest one to get wrong because it feels like more context equals better answers. It doesn't, and it has a fixed cost.
CLAUDE.md (or AGENTS.md, or whatever your tool calls it) is injected into the system prompt on every single turn. A 4,000-token CLAUDE.md isn't a one-time cost. It's a 4,000-token tax on message one, message two, and message two hundred. Even with caching it's still 10% of full price, every turn, for the life of the session.
Anthropic's own guidance is to keep CLAUDE.md under about 200 lines. Treat
it like a phonebook, not a manual. Build commands, file layout, naming
conventions, the three things that aren't obvious from a quick ls. Not
your whole style guide.
If you want detailed conventions documented somewhere, put them in normal docs and let the agent read those files on demand. That way you pay for them in the sessions where they matter and not in the sessions where they don't.
What actually moves the number
I ran a comparison on the same task two weekends in a row. Same feature, same starting commit, similar scope. First weekend I worked the way I had been working. Second weekend I applied four changes: a leaner CLAUDE.md, a stable system prefix for caching, a terse-output instruction, and clearing the session between tasks instead of letting it accumulate.
First weekend: $74. Second weekend: $11. The output was, if anything, slightly better, mostly because the shorter sessions kept the model from getting confused by its own earlier mistakes.
The interesting part isn't that any one of those changes is magic. None of them are. Caching alone saves you maybe 30%. Terse output alone saves you maybe 15%. The reason the combined effect is closer to 80% is that these costs multiply rather than add. A short prompt that hits the cache and produces short output is paying a small fraction of the price on three different axes at once.
A note on what's still open
The honest answer is that none of this is stable. Prompt caching pricing has changed twice this year. Anthropic shipped extended caching with a 1-hour TTL in late 2025 which changes the math on long sessions. New models keep moving where the cost-quality sweet spot sits. Codex and Claude Code both ship updates that quietly change how they construct context.
So the playbook above is what's working in May 2026. The underlying principles, attention is quadratic, decode is sequential, prefixes get cached, those aren't going anywhere. The specific knobs probably will. Worth re-checking your billing dashboard every couple of months and seeing what your token mix actually looks like, rather than trusting that whatever you set up last quarter is still optimal.
If you've only got time to do one thing this week, log your cache hit rate. Everything else flows from knowing that number.