Zubnet AILearnWiki › Prompt Caching
Using AI

Prompt Caching

Context Caching, Prefix Caching
A technique that saves and reuses the processed version of a prompt prefix across multiple API calls, avoiding redundant computation. If you send the same system prompt and document context with every request (which is common), prompt caching processes it once and reuses the cached computation for subsequent requests. This reduces both latency and cost.

Why it matters

Most AI applications send the same system prompt, few-shot examples, or reference documents with every request. Without caching, the provider processes this identical prefix every single time. Prompt caching can cut input token costs by 50–90% and reduce time-to-first-token significantly. For high-volume applications, this translates to thousands of dollars saved per month.

Deep Dive

The technical mechanism: during the "prefill" phase of LLM inference, the model processes all input tokens and computes their KV cache entries. Prompt caching stores this KV cache so that subsequent requests with the same prefix skip the prefill for the cached portion. Only new tokens (the user's actual message) need processing. Anthropic, OpenAI, and Google all offer some form of prompt caching.

How to Use It

Most implementations work by detecting matching prefixes automatically or by letting you mark cache breakpoints. The key constraint: only exact prefix matches count. If your system prompt changes by even one token, the cache misses. This means structuring your prompts with the stable parts first (system prompt, documents) and variable parts last (user message) is important for cache hit rates.

When It Matters Most

Prompt caching delivers the biggest savings when: (1) you have a long, stable prefix (large system prompts, RAG context), (2) you send many requests with that same prefix (chatbots, agents), and (3) input tokens are a significant portion of your costs. For applications with short, unique prompts, caching provides little benefit. For applications that stuff the context window with documents, it's transformative.

Related Concepts

← All Terms
← Prompt Prompt Engineering →