Zubnet AILearnWiki › KV Cache
Infrastructure

KV Cache

Key-Value Cache
A memory optimization that stores the previously computed key and value tensors from the attention mechanism so they don't need to be recomputed for each new token. During autoregressive generation, each new token attends to all previous tokens. Without caching, you'd recompute attention for the entire sequence at every step. The KV cache trades memory for speed by storing what's already been computed.

Why it matters

The KV cache is why LLM inference is memory-bound, not compute-bound. A long conversation with Claude doesn't just use memory for the model weights — the KV cache for a 100K token context can consume tens of gigabytes of VRAM. It's the reason providers charge more for longer contexts, why "context window" has a practical ceiling beyond the theoretical limit, and why techniques like paged attention and cache eviction are active research areas.

Deep Dive

In a Transformer, the attention mechanism computes three matrices for each token: Query (Q), Key (K), and Value (V). The query of the current token is compared against the keys of all previous tokens to produce attention weights, which are then used to weight the values. During generation, the Q changes with each new token, but the K and V for all previous tokens stay the same. The KV cache stores these K and V matrices so they're computed once and reused.

The Memory Math

KV cache size = 2 (K and V) × num_layers × num_heads × head_dim × sequence_length × bytes_per_element. For a 70B model with 80 layers, 64 heads, head dimension 128, at FP16: that's 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB per token. A 100K context therefore needs ~256 GB of KV cache alone — more than the model weights themselves. This is the fundamental constraint on long-context inference.

Optimizations

Several techniques address KV cache pressure. Grouped Query Attention (GQA) shares key-value heads across multiple query heads, reducing cache size by 4–8x. Multi-Query Attention (MQA) goes further with a single KV head. PagedAttention (used by vLLM) manages cache memory like virtual memory pages, eliminating fragmentation. Sliding window attention limits how far back each token looks, capping cache growth. Quantizing the KV cache to FP8 or INT4 is another practical lever — some quality loss, but 2–4x memory savings.

Related Concepts

← All Terms
← Knowledge Graph LangChain →