deep dive · claude code · part 4 — prompt cache

Part 3 showed how Claude Code rebuilds the entire request from scratch every turn — the system prompt, the tool schemas, and the ever-growing messages array. This article is about the mechanism that keeps re-sending all of that from getting expensive: how Claude Code uses the prompt cache to cut the cost — how it tags requests, what silently breaks the cache mid-session, and what the whole thing actually costs.

Most of the envelope is byte-for-byte identical to the previous turn. The prompt cache lets the server recognize that repeated prefix and charge a tenth of the price for it instead of recomputing it.

where the receipt comes from

Part 3 framed every turn as one request with three top-level fields: system, tools, and messages. What it left out is the other half of the round trip. The way we find out the token usage of each API call is the usage object: the Anthropic API returns it on every response, with the four token counts wrapped inside.

request · what you send

systembehavioral prompt

toolstool schemas

messagesconversation so far

▶

response · what anthropic returns

roleassistant▸

contenttext + tool_use▸

usagein · write · read · out ← the receipt▸

click a field in the response to see an example

the request grows every turn; the response always carries exactly one usage receipt for the call it just answered

only the assistant carries a receipt

Because usage comes from a response, and a response only exists for assistant turns, only AssistantMessage nodes carry it. As the conversation accumulates in the messages array, every stored assistant message keeps its own receipt; user messages — including the tool results the harness sends back — have content but none. Hover the receipts below for real examples.

user

"what's in package.json?"no usage

assistant

text + tool_useusage ▸

user

tool_result · file contentsno usage

assistant

"it depends on react and…"usage ▸

usage rides on assistant messages only · the two examples already show the cold-write → warm-read flip

the four numbers

These are the four token counts, and each one maps to a line in Anthropic's pricing table: (a) input_tokens at the base rate, (b) cache_creation at the write premium, (c) cache_read at the cache discount, and (d) output_tokens at the output rate. The first three are input-side and always sum to the call's full input; the fourth is generation.

input_tokens1×base rate

Processed fresh — no cache hit. In Claude Code this is usually small: almost the entire prefix sits under a cache marker, so the only "fresh" tokens are whatever falls outside a cached, unbroken prefix.

cache_creation_input_tokens+25%write premium

Written into the cache this call — the slice of the prefix the server hadn't seen before. You pay 25% over base once (2× for a 1-hour entry) so that later turns can read those same tokens cheaply.

cache_read_input_tokens−90%big discount

Served straight from the cache — the identical prefix the server already had stored, at just 10% of the base rate. In a warm session this is the largest of the four by far, and the reason long conversations stay affordable.

output_tokens1×output rate

Generated by the model. Billed at the output rate and never cached — generation is always fresh. Independent of the input window: a full input doesn't shrink how much you can generate.

input + cache_write + cache_read = the call's full input · output is billed separately

The two cache fields are the whole game. A cache read costs a tenth of the base input rate; a cache write costs 25% more. Write a prefix once, read it back cheaply on every later turn — and the four numbers are the scoreboard that tells you whether it's working.

each receipt is independent

Each receipt is for that one call, not a running total. The numbers don't accumulate — they reset and recompute every turn based on what was identical to last time and what was new.

turn 1 / 6

response.usage

input_tokens 0

cache_creation 0

cache_read 0

output_tokens 0

call cost · hover for math $0.000

input side · fresh / write / read

this turn · what entered the conversation

the ledger — one receipt per turn · they don't sum, each is its own call

input (fresh, 1×) cache write (+25%) cache read (−90%) output (1×) call cost · opus 4.8 · base $5 / 5-min write $6.25 / read $0.50 / output $25 per MTok · hover any $ for the breakdown

turn 1

Turn 1 is the cold write: the whole ~42K prefix — system prompt, tool schemas, the first message — is fresh-written into the cache, and there's nothing to read yet. From turn 2 on it inverts. The same prefix comes back as a cache read at a tenth of the price, and the only thing written is the small delta — the previous turn's reply plus your new message. The read column climbs every turn as the conversation grows; the write column stays tiny, spiking only when a big file read fattens the delta (turn 4).

That inversion is the entire value of the cache, and the four numbers make it legible. The rest of this series is about the machinery that produces them: the cache_control marker that defines the cacheable prefix (next), the TTL and the things that quietly break the cache, and finally the cost math that falls out of it all.

cache markers

The four numbers come from the server, but the server only knows what to cache because Claude Code tells it — with a marker called cache_control. Each marker says: everything from the top of the request down to here is one cacheable prefix. The API allows up to four of them per request; Claude Code effectively uses two — one covering the tools and static system prompt, one for the conversation so far.

Below is the actual request payload, in cache-prefix order — the API matches the prefix as tools → system → messages, regardless of how the JSON fields are ordered. Hover a marker to light up the prefix it caches; click any marker, the boundary, or the tools row for the details.

one request · two cache_control markers · hover to light the cached prefix

toolsrenders first in the cache prefix

tool schemas · Read, Bash, WebSearch, …built-in, identical per version · no marker of its own ▸

systemarray of text blocks

static · intro & identity"You are Claude Code…"

static · behavioral prompt# Doing tasks · # Using your tools · # Tone

◆ ①cache_control · scope: 'global'caches tools + static system ↑▸

— SYSTEM_PROMPT_DYNAMIC_BOUNDARY —

dynamic · env · auto-memory · MCP instructions · gitper-session, after the boundary

messagesconversation, grows

msg[0] · synthetic CLAUDE.md + date

conversation · user / assistant / tool blocks

◆ ②cache_control · ephemeralcaches everything ↑ (org-scoped)▸

hover a marker to light its cached prefix · click for the cache_control shape and scope

cache_control is always type: 'ephemeral' — Anthropic's name for a prompt-cache breakpoint (it expires after a TTL, covered next). marker ② carries no scope field, so it defaults to org scope.

The boundary isn't just bookkeeping — it's what makes the two markers shareable in completely different ways. Marker ① sits before the dynamic section, so its prefix — the built-in tools and the static system text — contains nothing folder-specific; it's genuinely shared across every user, machine, and directory. Marker ②'s prefix runs all the way through the dynamic section, which embeds your working directory, platform, and OS — so in practice it is scoped to one machine and one directory.

marker ① · global

tools + static system · no cwd

shared across every user, machine & folder
same Claude Code version + tool set
buys a warm start on the ~38–119K prefix

marker ② · org

whole request · embeds cwd

effectively one machine + one directory
a different folder or git worktree → cache miss
carries your actual conversation, every turn

So "global" really is shared with everyone — it just only covers the part of the prompt that has nothing folder-specific in it. The big prefix that carries your conversation is the per-directory one: two sessions in the same folder share it, two in different folders (or worktrees) miss each other.

found in source. The public cache-scope docs only describe the default behavior — caches isolated per organization, and per-directory in practice (marker ②). The cross-user sharing of the static prefix is the scope: 'global' mechanism (a prompt_caching_scope beta header) in Claude Code's own source; it isn't documented publicly, and it's first-party only — direct API-key callers get the org-scoped default.

how big is the shared block

That globally-shared static prefix is measurable — it's the cache_creation on the very first cold turn. A 2026 growth in versions v2.1.91–v2.1.96 roughly doubled it.

how long does it last

A cache entry isn't permanent — it has a TTL. There are two, and which one you get is decided by your account, not by anything in the request.

1 hour

extended · requested automatically

Claude subscription — Pro / Max / Enterprise, within plan usage
Anthropic internal — USER_TYPE = ant
any auth — opt in with ENABLE_PROMPT_CACHING_1H=1

5 minutes

default · per-token billing

subscription, once over the limit — on usage credits
API key · Bedrock · Vertex · Foundry — by default
force anywhere — FORCE_PROMPT_CACHING_5M=1

⚲ latched at session start. Eligibility is decided once, when the session opens — it never flips mid-session, even if your usage or plan state changes while you work. Each cache hit resets the TTL clock, so the entry stays warm as long as you keep working.

what breaks the cache

A warm entry is worth a lot, so it's worth knowing what quietly throws it away. Below is a live cache: it starts warm. Click any event to see whether it survives — and why. Some of the answers are surprising.

cache · WARM — prefix served at −90%

the cached prefix from last turn is intact · pick an event below

client-side · Claude Code changes the request, so the key changes

outside the client · nothing in the request changed, but the entry is gone

keeps the cache · appended to the end, or ignored until restart

click an event above

busting means the next call re-writes the whole prefix at +25% before it can be read cheaply again

action-by-action reference: Claude Code docs · how Claude Code uses prompt caching

Two findings stand out. First, editing CLAUDE.md mid-session does nothing — it's read once at startup and memoized, so your change is invisible until /clear or /compact (which recompute the prompt and bust the cache on purpose). Second, the beta headers behind fast mode and auto mode are sticky-on latched: once sent, Claude Code keeps sending them for the rest of the session, so turning the feature back off won't bust the cache a second time.

where an MCP server lands

One survivor deserves a closer look, because the instinct says it should bust: adding an MCP server. Tool schemas render first in the envelope — before marker ① — so if a new server's tools landed in that tools array, they'd change the bytes both markers depend on and bust the cache. That is exactly what happens without tool search — Claude Code's lazy-loading scheme that sends most tools (all MCP tools among them) by name only and fetches their full schemas on demand. With it on (the default on Opus/Sonnet), those schemas never enter the front prefix until you actually search for one — they only show up later, appended, if the model loads one. Toggle to see where they land.

add an MCP server · where its tools land in the envelope

tools · built-ins + ToolSearch + all MCP schemas

◆ ① global — caches tools + static system ↑

systemstatic · dynamic (env · memory · mcp instructions · git)

messagesmsg[0] · conversation

+ MCP tool · appended as a tool_reference when the model searches for it

◆ ② org — caches everything ↑

what it costs

Now the dollars. Everything so far has been mechanism; here's the pricing it runs on. These are the Opus 4.8 rates, per million tokens, relative to the base input rate.

operation	$ / MTok	vs input
input · no cache	$5.00	1×
cache read	$0.50	0.1× — the discount
5-minute cache write	$6.25	1.25×
1-hour cache write	$10.00	2× — the catch

the 1-hour write costs double the base rate · note: Claude Code's own cost display only knows the 5-minute write price, so it understates spend for 1-hour-eligible users

The trade is: pay the write premium once, then read at a tenth of the rate forever after. So when does a cache pay for itself? A 5-minute entry, which writes at 1.25×, is ahead after a single read. A 1-hour entry writes at 2×, so it needs two reads to break even — cheap insurance against a long idle gap, but not free.

over a whole session

Now zoom out to a whole session. Below is a 10-turn run whose context grows to 320K: the first turn writes the prefix cold, then each turn reads it back and writes only the new delta. Drag the slider to drop a cache bust on any turn — because a bust re-writes the entire prefix, the later it lands the bigger that prefix is, so the bigger the spike. The amber bracket measures that gap — the prefix it re-wrote instead of reading, times the write−read price difference. Toggle the TTL to widen the 1-hour write premium.

Notice the cached lines actually start above no-cache: turn 1 pays the write premium on the whole prefix with no prior read to amortize it, so a cold write — and every bust — costs more than not caching on that single turn. The reads on every later turn are what pull caching far below the no-cache line and keep it there.

Even with a bust mid-session, caching lands far under the no-cache line — the bust is a recoverable bump, not a reversal. The two rules of thumb fall straight out of the shape: don't change the model, effort, or tools mid-task (each is a full re-write at whatever size you've reached). And if you're stepping out for a lunch break longer than an hour, consider wrapping up that 700K-token session before you go — once it idles past your TTL the whole prefix expires, so picking it back up is a cold write. Save yourself a few bucks.