Zubnet AI学习Wiki › Prompt Caching
Using AI

Prompt Caching

Context Caching, Prefix Caching
一种在多次 API 调用中保存并复用已处理的 prompt 前缀的技术,避免冗余计算。如果你每次请求都发同样的 system prompt 和文档上下文(很常见),prompt caching 处理一次,后续请求就复用缓存的计算。这既降延迟也降成本。

为什么重要

大多数 AI 应用每次请求都发同样的 system prompt、few-shot 示例、或参考文档。没有 caching,供应商每次都处理这段一样的前缀。Prompt caching 能把输入 token 成本砍掉 50–90%,并显著减少 time-to-first-token。对高流量应用,这每月能省下数千美元。

Deep Dive

The technical mechanism: during the "prefill" phase of LLM inference, the model processes all input tokens and computes their KV cache entries. Prompt caching stores this KV cache so that subsequent requests with the same prefix skip the prefill for the cached portion. Only new tokens (the user's actual message) need processing. Anthropic, OpenAI, and Google all offer some form of prompt caching.

How to Use It

Most implementations work by detecting matching prefixes automatically or by letting you mark cache breakpoints. The key constraint: only exact prefix matches count. If your system prompt changes by even one token, the cache misses. This means structuring your prompts with the stable parts first (system prompt, documents) and variable parts last (user message) is important for cache hit rates.

When It Matters Most

Prompt caching delivers the biggest savings when: (1) you have a long, stable prefix (large system prompts, RAG context), (2) you send many requests with that same prefix (chatbots, agents), and (3) input tokens are a significant portion of your costs. For applications with short, unique prompts, caching provides little benefit. For applications that stuff the context window with documents, it's transformative.

相关概念

← 所有术语
← Prompt Prompt Engineering →