AI Coding Guides Deep Dives
Provider-side prompt caching • Cached-token discounts • Request shaping

Prompt Caching That Actually Reaches the API

Plenty of repos expose cached-token stats. Far fewer actually emit the right payload shape — cache_control markers, cached-content resources, provider-specific headers, stable-prefix rules, and breakpoint budgeting — that turns provider marketing into a real bill reduction.

(Alright, ad over. Back to the serious technical analysis.)

This is provider-side caching, not prompt injection defense

The subject here is the billing and latency layer: providers discount repeated prompt prefixes when the request keeps a sufficiently stable shape. That makes prompt caching a concrete implementation problem, not a vague product checkbox. Anthropic exposes explicit cache_control markers, Gemini exposes cached-content resources, DashScope uses vendor headers, and OpenAI-style APIs often surface cached-token counts without requiring the harness to shape the request explicitly.

ℹ️

The hard split

A repo can know about cached tokens without controlling them. Codex and Zaica are good examples of telemetry-first support. By contrast, Claude Code, Dirac, DeerFlow, ADK-Rust, and OpenCode write payload-shaping logic that can materially change whether the provider reuses the prefix.

[stable system prefix] ---- cached
[stable tool schema] ----- cached
[recent stable turns] ---- cached
[dynamic reminders/date] - uncached
[fresh user turn] -------- uncached

That split is why prompt caching belongs beside model and provider wiring, not beside permissions or prompt authoring. The interesting engineering is in which blocks stay stable, which blocks move to the dynamic suffix, and how the harness avoids silently busting the cache every turn.

Four support tiers

Direct request shaping

The harness emits explicit markers or cached-content references. This is where real implementation lives: breakpoint limits, TTL gating, provider transforms, and stable-prefix design.

Provider transform layer

Multi-provider repos tend to centralize caching in adapters or serializers. Dirac, Qwen Code, and OpenCode fit here.

Telemetry and parser support

The harness reads cached_tokens, cache_read_input_tokens, or similar usage fields, but does not shape the request. Useful for analytics, not enough to earn the discount by itself.

Cache-aware but not cache-directing

DeepSeek Reasonix is the best example. It tracks prefix-cache effectiveness and cost diagnostics, but the caching is still mostly provider magic rather than harness-authored markers.

Comparison matrix: who really implements it?

Repo Tier Validated mechanism What matters
Claude Code source snapshot Direct shaping Anthropic cache_control, 1h TTL gating, cache-break detection, scope/beta handling Best example of session-stable cache policy
Dirac Direct shaping Provider-specific caching branches for Anthropic, LiteLLM, Vertex, Gemini, Bedrock, and more Best multi-provider exemplar
DeerFlow Direct shaping Four-breakpoint budgeting, last-eligible placement, OAuth strip/disable path Excellent “hard API limit” example
Qwen Code Transform layer Anthropic cache markers plus DashScope cache headers and cached-token telemetry Strong bridge between request shaping and observability
ADK-Rust Transform layer Anthropic cache-control helpers plus Gemini cached-content lifecycle and pricing Best reusable framework example
OpenCode Transform layer Maps one abstraction to Anthropic, OpenRouter, Bedrock, OpenAI-compatible, Copilot, and Alibaba dialects Shows how abstractions survive provider divergence
DeepSeek Reasonix Cache-aware Prefix-cache hit diagnostics and cost visibility, but not the same request-side control surface Important correction to simplistic “supported/unsupported” buckets
Codex Telemetry only Cached-token metrics and status display Reads what happened after the fact
Zaica Telemetry only SSE parsing of Anthropic/OpenAI-style cache token fields Normalizes provider output rather than steering it
OpenHands / Kimi CLI Partial plumbing Cache-adjacent flags, types, or provider code paths, but weaker evidence of flagship request shaping Worth mentioning carefully, not headlining
Pochi / Wintermolt Absent No validated provider-side prompt-caching implementation surfaced in the current repo snapshot Useful negative controls

Anthropic turned caching into payload architecture

The Anthropic-style implementations are the clearest proof that prompt caching is not generic. The harness has to decide which blocks deserve cache_control, how many markers to spend, whether the TTL should remain stable for the whole session, and which dynamic content must be pushed out of the cached prefix.

Claude Code snapshot: session-stable policy

The Claude Code snapshot does the most sophisticated version of this. It latches 1h-TTL eligibility and allowlists in bootstrap state so mid-session config flips do not suddenly change the marker shape and bust the server-side cache. It also tracks prompt cache break detection as a first-class runtime concern.

DeerFlow: breakpoint budgeting

DeerFlow is the cleanest example of hard limit management. It caps itself at four breakpoints, places them on the last eligible blocks, and disables or strips caching on OAuth-shaped requests when that path would otherwise violate the provider constraints.

ADK-Rust: reusable cache helpers

ADK-Rust is less of an opinionated product shell and more of a reusable library. That makes its cache-control utilities valuable: they show what a framework needs to expose if downstream agent builders want to manage breakpoints instead of hardcoding them.

OpenCode and Dirac: abstraction pressure

Both repos demonstrate the real multi-provider problem: you can call the concept “prompt caching”, but the payload fields diverge. Anthropic wants cache_control, Bedrock speaks in cache points, OpenAI-compatible backends vary, and one transform layer has to normalize all of it.

⚠️

The real failure mode is churn

Timestamps, rotating tool lists, mutable reminders, session flags, and mid-conversation TTL changes all destroy the stable prefix. The best repos do not just “enable caching”; they freeze the policy inputs that would otherwise invalidate it.

There is no universal cache dialect

Once you leave Anthropic, the story gets more fragmented. That is not a bug in the repos; it is a property of the provider ecosystem.

Provider family Typical mechanism Strong repo example Why it matters
Anthropic / Claude API cache_control markers on prompt blocks Claude Code snapshot, DeerFlow, ADK-Rust, Dirac Explicit request shaping, explicit breakpoint budget
Gemini Cached-content resources and cached-token accounting ADK-Rust Separate lifecycle object rather than just inline markers
DashScope / Qwen Provider headers plus request rewriting Qwen Code Vendor-specific enablement path, not “just OpenAI compatible”
Bedrock / Vertex / Lite gateways Provider-specific transforms and capability flags Dirac, OpenCode Best example of why adapters cannot stay shallow
OpenAI-style APIs Often response-side cached-token reporting Codex, Zaica Observability may exist even when request-side control does not

The broader implication is the same one visible across provider wiring: the agents with the most realistic model support invest in serialization code, not just config files. Prompt caching makes that visible faster than almost any other feature because the APIs disagree on both how to enable it and how to report it back.

Telemetry is useful, but it is not the implementation

Codex and Zaica

These repos show the healthy version of telemetry-only support. They parse cached-token counts, surface them in status or usage structures, and give the operator visibility into whether the provider reused anything. That matters. It just is not the same as designing the prefix.

OpenHands and Kimi CLI

The current snapshot shows cache-adjacent plumbing rather than a flagship direct implementation. They belong in the discussion, but more as “interesting edges” than as the page's headline examples.

🧪

Why telemetry still matters

Telemetry-only repos are the easiest place to notice a mismatch between provider promises and real workload shape. That is exactly why Reasonix CheetahClaws is interesting: it treats prefix-cache efficiency as a diagnostic signal even though it does not own the same marker surface as the Anthropic-heavy clients.

What the strongest repos are really doing

Claude Code snapshot

Best-in-class session stability: TTL latching, cache-break detection, and careful separation between static prefix and dynamic suffix.

Dirac

The most convincing “provider matrix” story. Cache behavior is treated as a branch in the adapter layer, not a universal switch.

DeerFlow

Excellent implementation discipline: strict breakpoint budgeting, last-candidate placement, and tested edge cases around OAuth and payload shape.

Qwen Code

Good hybrid of transport-level enablement and telemetry. It proves the feature survives beyond Anthropic if the repo is willing to branch by provider.

ADK-Rust

Best reusable API surface for framework builders who want both Anthropic markers and Gemini cached-content lifecycle management.

OpenCode

Strong abstraction example: one transform layer, several cache dialects, and no pretense that the providers are actually uniform.

Cache-busting failure modes

Recommendations for new agent builders

  1. Model the stable prefix explicitly. Treat the system prompt, tool schema, and a few recent blocks as separate cacheable regions rather than dumping everything into one ever-changing body.
  2. Keep cache policy session-stable. If a feature flag, rollout bucket, or eligibility check can flip mid-session, latch it or you will destroy the prefix you just paid to establish.
  3. Budget breakpoints as a scarce resource. DeerFlow gets this right: explicit caps beat magical “cache everything” optimism.
  4. Use provider transforms, not fake universality. Anthropic, Gemini, DashScope, Bedrock, and OpenAI-compatible backends do not speak one cache language.
  5. Separate request shaping from telemetry. Both are worth building, but they solve different problems and should not be conflated in architecture docs or UI.
  6. Keep dynamic reminders out of the cached prefix. The same design decision that helps prompt caching also makes context compaction easier, because the harness has a cleaner line between durable context and turn-local noise.

Bottom line

Prompt caching is one of the clearest separators between “supports many providers” as a marketing claim and “understands provider behavior” as an engineering reality. The best repos do three things at once:

  1. shape the request deliberately,
  2. keep the cache policy stable across the session, and
  3. measure the outcome so the operator can see whether it worked.

If you only have time to steal one pattern, steal the combination of stable-prefix design plus provider-specific transforms. Everything else — cached-token dashboards, cost displays, usage tables — gets more honest once that foundation exists.