Zubnet AI學習Wiki › KV Cache
基礎設施

KV Cache

Key-Value Cache
一種記憶體優化,把 attention 機制先前計算過的 key 和 value tensor 存起來,這樣每個新 token 就不用重算。在自迴歸生成過程中,每個新 token 都要 attend 到所有前面的 token。沒有快取,你就得在每一步對整個序列重算 attention。KV cache 以記憶體換速度,把已經算過的存起來。

為什麼重要

KV cache 就是 LLM 推理受記憶體限制而不是受運算限制的原因。一段和 Claude 的長對話占的記憶體不只是模型權重 — 一個 100K token 上下文的 KV cache 可能吃掉數十 GB 的 VRAM。這就是為什麼供應商為更長的上下文多收錢、為什麼「上下文視窗」有個實際的天花板在理論極限之下、以及為什麼 paged attention 和快取驅逐這類技術是活躍的研究方向。

Deep Dive

In a Transformer, the attention mechanism computes three matrices for each token: Query (Q), Key (K), and Value (V). The query of the current token is compared against the keys of all previous tokens to produce attention weights, which are then used to weight the values. During generation, the Q changes with each new token, but the K and V for all previous tokens stay the same. The KV cache stores these K and V matrices so they're computed once and reused.

The Memory Math

KV cache size = 2 (K and V) × num_layers × num_heads × head_dim × sequence_length × bytes_per_element. For a 70B model with 80 layers, 64 heads, head dimension 128, at FP16: that's 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB per token. A 100K context therefore needs ~256 GB of KV cache alone — more than the model weights themselves. This is the fundamental constraint on long-context inference.

Optimizations

Several techniques address KV cache pressure. Grouped Query Attention (GQA) shares key-value heads across multiple query heads, reducing cache size by 4–8x. Multi-Query Attention (MQA) goes further with a single KV head. PagedAttention (used by vLLM) manages cache memory like virtual memory pages, eliminating fragmentation. Sliding window attention limits how far back each token looks, capping cache growth. Quantizing the KV cache to FP8 or INT4 is another practical lever — some quality loss, but 2–4x memory savings.

相關概念

← 所有術語
← Knowledge Graph LangChain →