Zubnet AI學習Wiki › Prompt Caching
Using AI

Prompt Caching

Context Caching, Prefix Caching
一種在多次 API 呼叫中儲存並複用已處理的 prompt 前綴的技術,避免冗餘運算。如果你每次請求都發同樣的 system prompt 和文件上下文(很常見),prompt caching 處理一次,後續請求就複用快取的運算。這既降延遲也降成本。

為什麼重要

大多數 AI 應用每次請求都發同樣的 system prompt、few-shot 範例、或參考文件。沒有 caching,供應商每次都處理這段一樣的前綴。Prompt caching 能把輸入 token 成本砍掉 50–90%,並顯著減少 time-to-first-token。對高流量應用,這每月能省下數千美元。

Deep Dive

The technical mechanism: during the "prefill" phase of LLM inference, the model processes all input tokens and computes their KV cache entries. Prompt caching stores this KV cache so that subsequent requests with the same prefix skip the prefill for the cached portion. Only new tokens (the user's actual message) need processing. Anthropic, OpenAI, and Google all offer some form of prompt caching.

How to Use It

Most implementations work by detecting matching prefixes automatically or by letting you mark cache breakpoints. The key constraint: only exact prefix matches count. If your system prompt changes by even one token, the cache misses. This means structuring your prompts with the stable parts first (system prompt, documents) and variable parts last (user message) is important for cache hit rates.

When It Matters Most

Prompt caching delivers the biggest savings when: (1) you have a long, stable prefix (large system prompts, RAG context), (2) you send many requests with that same prefix (chatbots, agents), and (3) input tokens are a significant portion of your costs. For applications with short, unique prompts, caching provides little benefit. For applications that stuff the context window with documents, it's transformative.

相關概念

← 所有術語
← Prompt Prompt Engineering →