Prompt Caching That Actually Reaches the API
Plenty of repos expose cached-token stats. Far fewer actually emit the
right payload shape — cache_control markers, cached-content
resources, provider-specific headers, stable-prefix rules, and
breakpoint budgeting — that turns provider marketing into a real bill
reduction.
This is provider-side caching, not prompt injection defense
The subject here is the billing and latency layer: providers discount
repeated prompt prefixes when the request keeps a sufficiently stable
shape. That makes prompt caching a concrete implementation problem,
not a vague product checkbox. Anthropic exposes
explicit cache_control markers, Gemini
exposes cached-content resources, DashScope uses
vendor headers, and OpenAI-style APIs often surface
cached-token counts without requiring the harness to shape the request
explicitly.
[stable system prefix] ---- cached
[stable tool schema] ----- cached
[recent stable turns] ---- cached
[dynamic reminders/date] - uncached
[fresh user turn] -------- uncached
That split is why prompt caching belongs beside model and provider wiring, not beside permissions or prompt authoring. The interesting engineering is in which blocks stay stable, which blocks move to the dynamic suffix, and how the harness avoids silently busting the cache every turn.
Four support tiers
Direct request shaping
The harness emits explicit markers or cached-content references. This is where real implementation lives: breakpoint limits, TTL gating, provider transforms, and stable-prefix design.
Provider transform layer
Multi-provider repos tend to centralize caching in adapters or serializers. Dirac, Qwen Code, and OpenCode fit here.
Telemetry and parser support
The harness reads cached_tokens,
cache_read_input_tokens, or similar usage fields, but
does not shape the request. Useful for analytics, not enough to
earn the discount by itself.
Cache-aware but not cache-directing
DeepSeek Reasonix is the best example. It tracks prefix-cache effectiveness and cost diagnostics, but the caching is still mostly provider magic rather than harness-authored markers.
Comparison matrix: who really implements it?
| Repo | Tier | Validated mechanism | What matters |
|---|---|---|---|
| Claude Code source snapshot | Direct shaping | Anthropic cache_control, 1h TTL gating, cache-break detection, scope/beta handling |
Best example of session-stable cache policy |
| Dirac | Direct shaping | Provider-specific caching branches for Anthropic, LiteLLM, Vertex, Gemini, Bedrock, and more | Best multi-provider exemplar |
| DeerFlow | Direct shaping | Four-breakpoint budgeting, last-eligible placement, OAuth strip/disable path | Excellent “hard API limit” example |
| Qwen Code | Transform layer | Anthropic cache markers plus DashScope cache headers and cached-token telemetry | Strong bridge between request shaping and observability |
| ADK-Rust | Transform layer | Anthropic cache-control helpers plus Gemini cached-content lifecycle and pricing | Best reusable framework example |
| OpenCode | Transform layer | Maps one abstraction to Anthropic, OpenRouter, Bedrock, OpenAI-compatible, Copilot, and Alibaba dialects | Shows how abstractions survive provider divergence |
| DeepSeek Reasonix | Cache-aware | Prefix-cache hit diagnostics and cost visibility, but not the same request-side control surface | Important correction to simplistic “supported/unsupported” buckets |
| Codex | Telemetry only | Cached-token metrics and status display | Reads what happened after the fact |
| Zaica | Telemetry only | SSE parsing of Anthropic/OpenAI-style cache token fields | Normalizes provider output rather than steering it |
| OpenHands / Kimi CLI | Partial plumbing | Cache-adjacent flags, types, or provider code paths, but weaker evidence of flagship request shaping | Worth mentioning carefully, not headlining |
| Pochi / Wintermolt | Absent | No validated provider-side prompt-caching implementation surfaced in the current repo snapshot | Useful negative controls |
Anthropic turned caching into payload architecture
The Anthropic-style implementations are the clearest proof that prompt
caching is not generic. The harness has to decide which blocks deserve
cache_control, how many markers to spend, whether the TTL
should remain stable for the whole session, and which dynamic content
must be pushed out of the cached prefix.
Claude Code snapshot: session-stable policy
The Claude Code snapshot does the most sophisticated version of this. It latches 1h-TTL eligibility and allowlists in bootstrap state so mid-session config flips do not suddenly change the marker shape and bust the server-side cache. It also tracks prompt cache break detection as a first-class runtime concern.
DeerFlow: breakpoint budgeting
DeerFlow is the cleanest example of hard limit management. It caps itself at four breakpoints, places them on the last eligible blocks, and disables or strips caching on OAuth-shaped requests when that path would otherwise violate the provider constraints.
ADK-Rust: reusable cache helpers
ADK-Rust is less of an opinionated product shell and more of a reusable library. That makes its cache-control utilities valuable: they show what a framework needs to expose if downstream agent builders want to manage breakpoints instead of hardcoding them.
OpenCode and Dirac: abstraction pressure
Both repos demonstrate the real multi-provider problem: you can
call the concept “prompt caching”, but the payload fields diverge.
Anthropic wants cache_control, Bedrock speaks in cache
points, OpenAI-compatible backends vary, and one transform layer
has to normalize all of it.
The real failure mode is churn
Timestamps, rotating tool lists, mutable reminders, session flags, and mid-conversation TTL changes all destroy the stable prefix. The best repos do not just “enable caching”; they freeze the policy inputs that would otherwise invalidate it.
There is no universal cache dialect
Once you leave Anthropic, the story gets more fragmented. That is not a bug in the repos; it is a property of the provider ecosystem.
| Provider family | Typical mechanism | Strong repo example | Why it matters |
|---|---|---|---|
| Anthropic / Claude API | cache_control markers on prompt blocks |
Claude Code snapshot, DeerFlow, ADK-Rust, Dirac | Explicit request shaping, explicit breakpoint budget |
| Gemini | Cached-content resources and cached-token accounting | ADK-Rust | Separate lifecycle object rather than just inline markers |
| DashScope / Qwen | Provider headers plus request rewriting | Qwen Code | Vendor-specific enablement path, not “just OpenAI compatible” |
| Bedrock / Vertex / Lite gateways | Provider-specific transforms and capability flags | Dirac, OpenCode | Best example of why adapters cannot stay shallow |
| OpenAI-style APIs | Often response-side cached-token reporting | Codex, Zaica | Observability may exist even when request-side control does not |
The broader implication is the same one visible across provider wiring: the agents with the most realistic model support invest in serialization code, not just config files. Prompt caching makes that visible faster than almost any other feature because the APIs disagree on both how to enable it and how to report it back.
Telemetry is useful, but it is not the implementation
Codex and Zaica
These repos show the healthy version of telemetry-only support. They parse cached-token counts, surface them in status or usage structures, and give the operator visibility into whether the provider reused anything. That matters. It just is not the same as designing the prefix.
OpenHands and Kimi CLI
The current snapshot shows cache-adjacent plumbing rather than a flagship direct implementation. They belong in the discussion, but more as “interesting edges” than as the page's headline examples.
Why telemetry still matters
Telemetry-only repos are the easiest place to notice a mismatch between provider promises and real workload shape. That is exactly why Reasonix CheetahClaws is interesting: it treats prefix-cache efficiency as a diagnostic signal even though it does not own the same marker surface as the Anthropic-heavy clients.
What the strongest repos are really doing
Claude Code snapshot
Best-in-class session stability: TTL latching, cache-break detection, and careful separation between static prefix and dynamic suffix.
Dirac
The most convincing “provider matrix” story. Cache behavior is treated as a branch in the adapter layer, not a universal switch.
DeerFlow
Excellent implementation discipline: strict breakpoint budgeting, last-candidate placement, and tested edge cases around OAuth and payload shape.
Qwen Code
Good hybrid of transport-level enablement and telemetry. It proves the feature survives beyond Anthropic if the repo is willing to branch by provider.
ADK-Rust
Best reusable API surface for framework builders who want both Anthropic markers and Gemini cached-content lifecycle management.
OpenCode
Strong abstraction example: one transform layer, several cache dialects, and no pretense that the providers are actually uniform.
Cache-busting failure modes
- Mutable system prompts: injecting the date, session hints, or user memory into the cached prefix guarantees churn.
- Tool-list instability: changing tool definitions mid-session can invalidate a large prefix unless the harness isolates that portion.
- TTL flips: changing from 5m to 1h during a session is itself a payload shape change. Claude Code's session-stable TTL latching exists specifically to avoid that.
- Auth path changes: DeerFlow's OAuth logic shows that billing headers and cache markers can conflict unless the repo handles that path explicitly.
- Confusing analytics with control: cached-token stats help you diagnose waste, but they do not make the provider cache a prefix you never shaped correctly.
Recommendations for new agent builders
- Model the stable prefix explicitly. Treat the system prompt, tool schema, and a few recent blocks as separate cacheable regions rather than dumping everything into one ever-changing body.
- Keep cache policy session-stable. If a feature flag, rollout bucket, or eligibility check can flip mid-session, latch it or you will destroy the prefix you just paid to establish.
- Budget breakpoints as a scarce resource. DeerFlow gets this right: explicit caps beat magical “cache everything” optimism.
- Use provider transforms, not fake universality. Anthropic, Gemini, DashScope, Bedrock, and OpenAI-compatible backends do not speak one cache language.
- Separate request shaping from telemetry. Both are worth building, but they solve different problems and should not be conflated in architecture docs or UI.
- Keep dynamic reminders out of the cached prefix. The same design decision that helps prompt caching also makes context compaction easier, because the harness has a cleaner line between durable context and turn-local noise.
Bottom line
Prompt caching is one of the clearest separators between “supports many providers” as a marketing claim and “understands provider behavior” as an engineering reality. The best repos do three things at once:
- shape the request deliberately,
- keep the cache policy stable across the session, and
- measure the outcome so the operator can see whether it worked.
If you only have time to steal one pattern, steal the combination of stable-prefix design plus provider-specific transforms. Everything else — cached-token dashboards, cost displays, usage tables — gets more honest once that foundation exists.