deep dive · claude code · part 3 — context

2026-04-23 · 10 min read · #claude-code #source-dive #agents

Every time you press enter in Claude Code, the harness assembles a complete document and ships it to the Anthropic API. The document is rebuilt from scratch each turn — but most of it is identical to the previous turn. Understanding what goes in it, and how it grows over the course of a session, is the foundation for the next two articles: why the prompt cache matters (Part 4), and why compaction is necessary (Part 5).

the context window question

"200K context window" is a shorthand that's slightly misleading. The number varies by model — 200K is the default, and 1M is available for Claude Sonnet 4.6 and Opus 4.6. More precisely, the context window is the input limit. Output is a separate budget: up to 64K tokens for most models, 128K for Opus. The two are independent — a full 200K input doesn't reduce how many output tokens you can receive.

The effective input budget is also smaller than the advertised window. Claude Code triggers compaction before the conversation fills the input window — it reserves 20K tokens of headroom for the compaction summary output, plus a 13K safety buffer. The autocompact threshold is:

how the harness slices the context window — same 33K reserve, very different proportions 200K model 167K usable = 83.5% of window · 33K reserved is 16.5% 0 167K 200K 1M model · same absolute scale, 5× wider 967K usable = 96.7% of window · same 33K reserved is only 3.3% 0 967K 1M usable input 13K safety buffer 20K compaction output reserve the 33K reserve is fixed by the harness — its proportional cost shrinks as the window grows.

what fills the envelope

An Anthropic API request has three top-level fields: system, tools, and messages. All three count against the input window. Click any segment to see what's actually inside.

an api request — three top-level fields. click one to drill in.
click a field on the left to see its sub-sections
static · intro & identity
static · # System (text output, permissions, hooks)
static · # Doing tasks (engineering guidance)
static · # Executing actions with care
static · # Using your tools
static · # Tone & style + # Output efficiency
— SYSTEM_PROMPT_DYNAMIC_BOUNDARY —
dynamic · # Environment (cwd, platform, model, cutoff)
dynamic · # auto memory (model-authored MEMORY.md)
dynamic · # MCP server instructions (per connected server)
git status (appended at the end via appendSystemContext)
~160 built-in tool schemas (Read, Bash, Edit, Grep, …)
connected MCP tool schemas (per-server, prefixed mcp__<server>__<tool>)
[0] synthetic user msg · <system-reminder> with CLAUDE.md + today's date
[1...N] real conversation · text + tool_use + tool_result blocks
click a field above to expand it · click a segment to see its contents

the evolving envelope

Most of the envelope doesn't change turn to turn. The system prompt is the same. The tool schemas are the same. What grows is the messages array — every turn adds at least one user/assistant pair, and tool calls add more.

Each tool call leaves two permanent blocks in the array. "What's in package.json?" produces three messages, two of which the user never sees:

// user turn
{ role: "user",
  content: "what's in package.json?" }

// assistant turn — text + tool_use
{ role: "assistant", content: [
    { type: "text",
      text: "let me read it" },
    { type: "tool_use",
      name: "Read", input: {...} }
] }

// user turn — tool_result
{ role: "user", content: [
    { type: "tool_result",
      content: "...file contents..." }
] }

Both new blocks stay in the messages array for the rest of the session. So does the assistant's text reply itself — every "let me look at this", every analysis paragraph, every summary at the end of a tool sequence. Claude's replies aren't free; they accumulate the same way tool results do. With extended thinking enabled, those reasoning blocks persist too. A typical assistant turn is a few hundred to a couple thousand tokens of text alongside any tool calls.

Parallel tool calls stack the same way — three simultaneous Greps add three tool_use blocks in one assistant message and three tool_result blocks in the next user message. Six new permanent blocks from one logical action.

Even with assistant text in the mix, in tool-heavy sessions the tool results still dominate. Below is a realistic 16-turn session (debugging + documentation), with every tool call's contribution shown as its own labeled chunk and the dialogue (user + assistant text + tool_use blocks) shown as the dim grey strips. Press play, step through, or drag the scrubber.

turn 0 / 16
22,000 / 200K tokens · 13% of 167K threshold
api envelope · 0 tokens
0 167K · autocompact 200K · window
hover any chunk for its label and token count
hover the bar to magnify the current turn's chunks · then click any one to inspect
main bar: system tool schemas msg[0] human-turn slot · zoom strip: user asst tool_use tool_result
press play, step, or drag the scrubber to walk through the session
turn 0
below: swap between per-chunk detail and the running cost comparison · both update as you step
click any chunk in the bar to see what's in it
calibration notes & sources of variance

baseline (~22K: system+skills+memory 9K, tool schemas 8K, msg[0] 5K) is calibrated against a live /context reading in a fresh Sonnet 4.6 session, no MCP servers connected. after 16 turns (12 of debugging a layout + auth bug, 4 more asking claude to survey the payment system and run tests) the envelope sits at ~113K (~68% of threshold). tool results still dominate per-turn growth — but notice turn 15, where an 8K model-authored summary is the entire contribution.

your numbers will vary. the shape of growth (baseline + slow dialog + spiky tool results) is stable; the magnitudes aren't. main sources of variance: CLAUDE.md sizecan swing msg[0] from 0.5K to 25K+ MCP serverscan balloon tool schemas from 8K to 50K+ file reads & bash outputscale with project size and what gets run memory + skillsdepend on auto-memory usage and which skills surface what stays constant across users: the static behavioral sections (~6K), the boundary marker, and the autocompact formula.

Toggle the cumulative cost view on the animation above. If every token were billed at the full input rate, cumulative cost would grow quadratic — envelope grows ~linearly each turn, and you pay for the whole prefix every turn. In practice the server charges 10% of the input rate for prefix bytes it's already seen. Dotted lines show the same session on Opus 4.6 (~1.67× Sonnet rates).

takeaway

Two rules of thumb for using Claude Code:

Keep the context narrow. Tool results — file reads, bash output, grep matches — fill the envelope far more than prompts do, and they stay there until compaction. Pointed reads and targeted searches are cheaper per turn and buy runway before the 167K autocompact threshold trips.

Keep sessions short. The uncached cost curve is quadratic in turns, so a session that sprawls across two unrelated tasks is markedly more expensive than splitting them — even with caching, the savings gap widens as the envelope grows. Very long contexts also correlate with worse output — Chroma's Context Rot study found every one of 18 frontier models (Claude 4 included) degrades with input length, well before the window is full. A fresh session pays the ~22K baseline again, but the clean slate is usually worth it.

The next articles dig into how caching and compaction actually work.

references