deep dive · claude code · part 4 — prompt cache

2026-06-06 · 12 min read · #claude-code #source-dive #prompt-cache

Part 3 showed how Claude Code rebuilds the entire request from scratch every turn — the system prompt, the tool schemas, and the ever-growing messages array. This article is about the mechanism that keeps re-sending all of that from getting expensive: how Claude Code uses the prompt cache to cut the cost — how it tags requests, what silently breaks the cache mid-session, and what the whole thing actually costs.

Most of the envelope is byte-for-byte identical to the previous turn. The prompt cache lets the server recognize that repeated prefix and charge a tenth of the price for it instead of recomputing it.

where the receipt comes from

Part 3 framed every turn as one request with three top-level fields: system, tools, and messages. What it left out is the other half of the round trip. The way we find out the token usage of each API call is the usage object: the Anthropic API returns it on every response, with the four token counts wrapped inside.

request · what you send
systembehavioral prompt
toolstool schemas
messagesconversation so far
response · what anthropic returns
roleassistant
contenttext + tool_use
usagein · write · read · out ← the receipt
click a field in the response to see an example
the request grows every turn; the response always carries exactly one usage receipt for the call it just answered

only the assistant carries a receipt

Because usage comes from a response, and a response only exists for assistant turns, only AssistantMessage nodes carry it. As the conversation accumulates in the messages array, every stored assistant message keeps its own receipt; user messages — including the tool results the harness sends back — have content but none. Hover the receipts below for real examples.

user
"what's in package.json?"no usage
assistant
text + tool_useusage ▸
user
tool_result · file contentsno usage
assistant
"it depends on react and…"usage ▸
usage rides on assistant messages only · the two examples already show the cold-write → warm-read flip

the four numbers

These are the four token counts, and each one maps to a line in Anthropic's pricing table: (a) input_tokens at the base rate, (b) cache_creation at the write premium, (c) cache_read at the cache discount, and (d) output_tokens at the output rate. The first three are input-side and always sum to the call's full input; the fourth is generation.

input_tokensbase rate
Processed fresh — no cache hit. In Claude Code this is usually small: almost the entire prefix sits under a cache marker, so the only "fresh" tokens are whatever falls outside a cached, unbroken prefix.
cache_creation_input_tokens+25%write premium
Written into the cache this call — the slice of the prefix the server hadn't seen before. You pay 25% over base once (2× for a 1-hour entry) so that later turns can read those same tokens cheaply.
cache_read_input_tokens−90%big discount
Served straight from the cache — the identical prefix the server already had stored, at just 10% of the base rate. In a warm session this is the largest of the four by far, and the reason long conversations stay affordable.
output_tokensoutput rate
Generated by the model. Billed at the output rate and never cached — generation is always fresh. Independent of the input window: a full input doesn't shrink how much you can generate.
input + cache_write + cache_read = the call's full input · output is billed separately

The two cache fields are the whole game. A cache read costs a tenth of the base input rate; a cache write costs 25% more. Write a prefix once, read it back cheaply on every later turn — and the four numbers are the scoreboard that tells you whether it's working.

each receipt is independent

Each receipt is for that one call, not a running total. The numbers don't accumulate — they reset and recompute every turn based on what was identical to last time and what was new.

Step through a real session below and see the usage of each call.

turn 1 / 6
response.usage
input_tokens 0
cache_creation 0
cache_read 0
output_tokens 0
call cost · hover for math $0.000
input side · fresh / write / read
this turn · what entered the conversation
the ledger — one receipt per turn · they don't sum, each is its own call
input (fresh, 1×) cache write (+25%) cache read (−90%) output (1×) call cost · opus 4.8 · base $5 / 5-min write $6.25 / read $0.50 / output $25 per MTok · hover any $ for the breakdown
turn 1

Turn 1 is the cold write: the whole ~42K prefix — system prompt, tool schemas, the first message — is fresh-written into the cache, and there's nothing to read yet. From turn 2 on it inverts. The same prefix comes back as a cache read at a tenth of the price, and the only thing written is the small delta — the previous turn's reply plus your new message. The read column climbs every turn as the conversation grows; the write column stays tiny, spiking only when a big file read fattens the delta (turn 4).

That inversion is the entire value of the cache, and the four numbers make it legible. The rest of this series is about the machinery that produces them: the cache_control marker that defines the cacheable prefix (next), the TTL and the things that quietly break the cache, and finally the cost math that falls out of it all.

cache markers

The four numbers come from the server, but the server only knows what to cache because Claude Code tells it — with a marker called cache_control. Each marker says: everything from the top of the request down to here is one cacheable prefix. The API allows up to four of them per request; Claude Code effectively uses two — one covering the tools and static system prompt, one for the conversation so far.

Below is the actual request payload, in cache-prefix order — the API matches the prefix as tools → system → messages, regardless of how the JSON fields are ordered. Hover a marker to light up the prefix it caches; click any marker, the boundary, or the tools row for the details.

one request · two cache_control markers · hover to light the cached prefix
toolsrenders first in the cache prefix
tool schemas · Read, Bash, WebSearch, …built-in, identical per version · no marker of its own ▸
systemarray of text blocks
static · intro & identity"You are Claude Code…"
static · behavioral prompt# Doing tasks · # Using your tools · # Tone
◆ ①cache_control · scope: 'global'caches tools + static system ↑
— SYSTEM_PROMPT_DYNAMIC_BOUNDARY —
dynamic · env · auto-memory · MCP instructions · gitper-session, after the boundary
messagesconversation, grows
msg[0] · synthetic CLAUDE.md + date
conversation · user / assistant / tool blocks
◆ ②cache_control · ephemeralcaches everything ↑ (org-scoped)
hover a marker to light its cached prefix · click for the cache_control shape and scope
cache_control is always type: 'ephemeral' — Anthropic's name for a prompt-cache breakpoint (it expires after a TTL, covered next). marker ② carries no scope field, so it defaults to org scope.

The boundary isn't just bookkeeping — it's what makes the two markers shareable in completely different ways. Marker ① sits before the dynamic section, so its prefix — the built-in tools and the static system text — contains nothing folder-specific; it's genuinely shared across every user, machine, and directory. Marker ②'s prefix runs all the way through the dynamic section, which embeds your working directory, platform, and OS — so in practice it is scoped to one machine and one directory.

marker ① · global
tools + static system · no cwd
marker ② · org
whole request · embeds cwd

So "global" really is shared with everyone — it just only covers the part of the prompt that has nothing folder-specific in it. The big prefix that carries your conversation is the per-directory one: two sessions in the same folder share it, two in different folders (or worktrees) miss each other.

found in source. The public cache-scope docs only describe the default behavior — caches isolated per organization, and per-directory in practice (marker ②). The cross-user sharing of the static prefix is the scope: 'global' mechanism (a prompt_caching_scope beta header) in Claude Code's own source; it isn't documented publicly, and it's first-party only — direct API-key callers get the org-scoped default.

how big is the shared block

That globally-shared static prefix is measurable — it's the cache_creation on the very first cold turn. A 2026 growth in versions v2.1.91–v2.1.96 roughly doubled it.

static system prompt · tokens saved on every warm turn (measured from the cold-cache write) ~38–52K baseline ~112–119K · after v2.1.91–96 spike 0 50K 100K 125K
the bigger the static block, the more every warm turn saves — and the more a single cache miss costs (issue #45188)

how long does it last

A cache entry isn't permanent — it has a TTL. There are two, and which one you get is decided by your account, not by anything in the request.

1 hour
extended · requested automatically
5 minutes
default · per-token billing
⚲ latched at session start. Eligibility is decided once, when the session opens — it never flips mid-session, even if your usage or plan state changes while you work. Each cache hit resets the TTL clock, so the entry stays warm as long as you keep working.

what breaks the cache

A warm entry is worth a lot, so it's worth knowing what quietly throws it away. Below is a live cache: it starts warm. Click any event to see whether it survives — and why. Some of the answers are surprising.

cache · WARM — prefix served at −90%
the cached prefix from last turn is intact · pick an event below
client-side · Claude Code changes the request, so the key changes
outside the client · nothing in the request changed, but the entry is gone
keeps the cache · appended to the end, or ignored until restart
click an event above
busting means the next call re-writes the whole prefix at +25% before it can be read cheaply again
action-by-action reference: Claude Code docs · how Claude Code uses prompt caching

Two findings stand out. First, editing CLAUDE.md mid-session does nothing — it's read once at startup and memoized, so your change is invisible until /clear or /compact (which recompute the prompt and bust the cache on purpose). Second, the beta headers behind fast mode and auto mode are sticky-on latched: once sent, Claude Code keeps sending them for the rest of the session, so turning the feature back off won't bust the cache a second time.

where an MCP server lands

One survivor deserves a closer look, because the instinct says it should bust: adding an MCP server. Tool schemas render first in the envelope — before marker ① — so if a new server's tools landed in that tools array, they'd change the bytes both markers depend on and bust the cache. That is exactly what happens without tool search — Claude Code's lazy-loading scheme that sends most tools (all MCP tools among them) by name only and fetches their full schemas on demand. With it on (the default on Opus/Sonnet), those schemas never enter the front prefix until you actually search for one — they only show up later, appended, if the model loads one. Toggle to see where they land.

add an MCP server · where its tools land in the envelope
tools · built-ins + ToolSearch + all MCP schemas
◆ ① global — caches tools + static system ↑
systemstatic · dynamic (env · memory · mcp instructions · git)
messagesmsg[0] · conversation
+ MCP tool · appended as a tool_reference when the model searches for it
◆ ② org — caches everything ↑

what it costs

Now the dollars. Everything so far has been mechanism; here's the pricing it runs on. These are the Opus 4.8 rates, per million tokens, relative to the base input rate.

operation$ / MTokvs input
input · no cache$5.00
cache read$0.500.1× — the discount
5-minute cache write$6.251.25×
1-hour cache write$10.002× — the catch
the 1-hour write costs double the base rate · note: Claude Code's own cost display only knows the 5-minute write price, so it understates spend for 1-hour-eligible users

The trade is: pay the write premium once, then read at a tenth of the rate forever after. So when does a cache pay for itself? A 5-minute entry, which writes at 1.25×, is ahead after a single read. A 1-hour entry writes at 2×, so it needs two reads to break even — cheap insurance against a long idle gap, but not free.

over a whole session

Now zoom out to a whole session. Below is a 10-turn run whose context grows to 320K: the first turn writes the prefix cold, then each turn reads it back and writes only the new delta. Drag the slider to drop a cache bust on any turn — because a bust re-writes the entire prefix, the later it lands the bigger that prefix is, so the bigger the spike. The amber bracket measures that gap — the prefix it re-wrote instead of reading, times the write−read price difference. Toggle the TTL to widen the 1-hour write premium.

Notice the cached lines actually start above no-cache: turn 1 pays the write premium on the whole prefix with no prior read to amortize it, so a cold write — and every bust — costs more than not caching on that single turn. The reads on every later turn are what pull caching far below the no-cache line and keep it there.

warm cache$0.00 with one bust$0.00 no cache$0.00
TTL
bust at turn 5
warm cache (cold write t1, then reads) one bust mid-session no cache (full input every turn)

Even with a bust mid-session, caching lands far under the no-cache line — the bust is a recoverable bump, not a reversal. The two rules of thumb fall straight out of the shape: don't change the model, effort, or tools mid-task (each is a full re-write at whatever size you've reached). And if you're stepping out for a lunch break longer than an hour, consider wrapping up that 700K-token session before you go — once it idles past your TTL the whole prefix expires, so picking it back up is a cold write. Save yourself a few bucks.

references