Part 3 showed how Claude Code rebuilds the entire request from scratch every turn — the system prompt, the tool schemas, and the ever-growing messages array. This article is about the mechanism that keeps re-sending all of that from getting expensive: how Claude Code uses the prompt cache to cut the cost — how it tags requests, what silently breaks the cache mid-session, and what the whole thing actually costs.
Most of the envelope is byte-for-byte identical to the previous turn. The prompt cache lets the server recognize that repeated prefix and charge a tenth of the price for it instead of recomputing it.
Part 3 framed every turn as one request with three top-level fields:
system, tools, and messages. What it left
out is the other half of the round trip. The way we find out the token usage of
each API call is the
usage
object: the Anthropic API returns it on every response, with the four token counts
wrapped inside.
Because usage comes from a response, and a response only exists for
assistant turns, only AssistantMessage nodes carry it. As the
conversation accumulates in the
messages array,
every stored assistant message keeps its own receipt; user messages — including
the tool results the harness sends back — have content but none. Hover the
receipts below for real examples.
These are the four token counts, and each one maps to a line in
Anthropic's pricing table:
(a) input_tokens at the base rate, (b) cache_creation at
the write premium, (c) cache_read at the cache discount, and
(d) output_tokens at the output rate. The first three are input-side
and always sum to the call's full input; the fourth is generation.
The two cache fields are the whole game. A cache read costs a tenth of the base input rate; a cache write costs 25% more. Write a prefix once, read it back cheaply on every later turn — and the four numbers are the scoreboard that tells you whether it's working.
Each receipt is for that one call, not a running total. The numbers don't accumulate — they reset and recompute every turn based on what was identical to last time and what was new.
Step through a real session below and see the usage of each call.
Turn 1 is the cold write: the whole ~42K prefix — system prompt, tool schemas, the first message — is fresh-written into the cache, and there's nothing to read yet. From turn 2 on it inverts. The same prefix comes back as a cache read at a tenth of the price, and the only thing written is the small delta — the previous turn's reply plus your new message. The read column climbs every turn as the conversation grows; the write column stays tiny, spiking only when a big file read fattens the delta (turn 4).
That inversion is the entire value of the cache, and the four numbers make it legible. The rest of this series is about the machinery that produces them: the cache_control marker that defines the cacheable prefix (next), the TTL and the things that quietly break the cache, and finally the cost math that falls out of it all.
The four numbers come from the server, but the server only knows what to cache because Claude Code tells it — with a marker called cache_control. Each marker says: everything from the top of the request down to here is one cacheable prefix. The API allows up to four of them per request; Claude Code effectively uses two — one covering the tools and static system prompt, one for the conversation so far.
Below is the actual request payload, in cache-prefix order — the API matches the
prefix as tools → system → messages, regardless of how the JSON fields
are ordered. Hover a marker to light up the prefix it caches; click any marker, the
boundary, or the tools row for the details.
The boundary isn't just bookkeeping — it's what makes the two markers shareable in completely different ways. Marker ① sits before the dynamic section, so its prefix — the built-in tools and the static system text — contains nothing folder-specific; it's genuinely shared across every user, machine, and directory. Marker ②'s prefix runs all the way through the dynamic section, which embeds your working directory, platform, and OS — so in practice it is scoped to one machine and one directory.
So "global" really is shared with everyone — it just only covers the part of the prompt that has nothing folder-specific in it. The big prefix that carries your conversation is the per-directory one: two sessions in the same folder share it, two in different folders (or worktrees) miss each other.
found in source. The public
cache-scope docs
only describe the default behavior — caches isolated per organization, and
per-directory in practice (marker ②). The cross-user sharing of the static prefix
is the scope: 'global' mechanism (a prompt_caching_scope
beta header) in Claude Code's own source; it isn't documented publicly, and it's
first-party only — direct API-key callers get the org-scoped default.
That globally-shared static prefix is measurable — it's the
cache_creation on the very first cold turn. A 2026 growth in
versions v2.1.91–v2.1.96 roughly doubled it.
A cache entry isn't permanent — it has a TTL. There are two, and which one you get is decided by your account, not by anything in the request.
A warm entry is worth a lot, so it's worth knowing what quietly throws it away. Below is a live cache: it starts warm. Click any event to see whether it survives — and why. Some of the answers are surprising.
Two findings stand out. First, editing CLAUDE.md mid-session does
nothing — it's read once at startup and memoized, so your change is invisible until
/clear or /compact (which recompute the prompt and bust
the cache on purpose). Second, the beta headers behind fast mode and auto mode are
sticky-on latched: once sent, Claude Code keeps sending them for
the rest of the session, so turning the feature back off won't bust the cache a
second time.
One survivor deserves a closer look, because the instinct says it should
bust: adding an MCP server. Tool schemas render first in the envelope — before
marker ① — so if a new server's tools landed in that tools array,
they'd change the bytes both markers depend on and bust the cache. That is exactly
what happens without
tool search —
Claude Code's lazy-loading scheme that sends most tools (all MCP tools among them)
by name only and fetches their full schemas on demand. With it on (the default on
Opus/Sonnet), those schemas never enter the front prefix until you actually search
for one — they only show up later, appended, if the model loads one. Toggle to see
where they land.
Now the dollars. Everything so far has been mechanism; here's the pricing it runs on. These are the Opus 4.8 rates, per million tokens, relative to the base input rate.
| operation | $ / MTok | vs input |
|---|---|---|
| input · no cache | $5.00 | 1× |
| cache read | $0.50 | 0.1× — the discount |
| 5-minute cache write | $6.25 | 1.25× |
| 1-hour cache write | $10.00 | 2× — the catch |
The trade is: pay the write premium once, then read at a tenth of the rate forever after. So when does a cache pay for itself? A 5-minute entry, which writes at 1.25×, is ahead after a single read. A 1-hour entry writes at 2×, so it needs two reads to break even — cheap insurance against a long idle gap, but not free.
Now zoom out to a whole session. Below is a 10-turn run whose context grows to 320K: the first turn writes the prefix cold, then each turn reads it back and writes only the new delta. Drag the slider to drop a cache bust on any turn — because a bust re-writes the entire prefix, the later it lands the bigger that prefix is, so the bigger the spike. The amber bracket measures that gap — the prefix it re-wrote instead of reading, times the write−read price difference. Toggle the TTL to widen the 1-hour write premium.
Notice the cached lines actually start above no-cache: turn 1 pays the write premium on the whole prefix with no prior read to amortize it, so a cold write — and every bust — costs more than not caching on that single turn. The reads on every later turn are what pull caching far below the no-cache line and keep it there.
Even with a bust mid-session, caching lands far under the no-cache line — the bust is a recoverable bump, not a reversal. The two rules of thumb fall straight out of the shape: don't change the model, effort, or tools mid-task (each is a full re-write at whatever size you've reached). And if you're stepping out for a lunch break longer than an hour, consider wrapping up that 700K-token session before you go — once it idles past your TTL the whole prefix expires, so picking it back up is a cold write. Save yourself a few bucks.