KV cache is now the inference bottleneck — a survey of 10 compression techniques (H2O, StreamingLLM, KIVI, KVQuant, TurboQuant, MLA, GQA, Palu) lays out the trade-off space

KV cache memory has overtaken model weights as the load-bearing constraint for production LLM inference at scale. The numbers from a Marktechpost technical survey published Wednesday: a 30-billion-parameter model running batch 128 with 1024-token inputs needs roughly 180GB just for KV cache state. For a 7B model, KV cache (72GB) is 5x bigger than the model parameters themselves (14GB at FP16). That's the inversion that drives an active research area — compress the KV cache without retraining the base model and you reclaim batch-size headroom, increase throughput, and serve more concurrent users on the same hardware. The survey lays out 10 production-relevant techniques across four strategy families.

Family one is eviction — keep some tokens, drop others. H2O (Heavy Hitter Oracle, NeurIPS 2023) observed that a small fraction of tokens carry most attention mass and dynamically retains those plus recent tokens, achieving 29x throughput over HuggingFace Accelerate on OPT-6.7B/30B. StreamingLLM keeps the first few tokens (which act as "attention sinks") plus a sliding recency window — fast and hardware-friendly but blind to semantic importance in the middle context. SnapKV uses an observation window at the end of long prompts to compress the prefill stage specifically, attacking a phase H2O leaves untouched. PyramidKV / PyramidInfer allocates different cache sizes per layer based on observed attention patterns, claiming 2.2x throughput and 54% GPU memory reduction. The eviction family's failure mode is information loss: anything thrown away is gone for the rest of generation, so quality degrades on tasks that need scattered mid-context recall.

Family two is quantization — keep all tokens, lower the bits per token. KIVI is plug-and-play 2-bit KV quantization with no fine-tuning, quantizing keys per-channel and values per-token; reports 2.6x peak memory reduction, 4x larger batches, and 2.35-3.47x throughput gains. KVQuant adds calibrated mixed-precision (per-channel key quant, pre-RoPE quantization, dense-sparse decomposition) and pushes to sub-4-bit precision for contexts up to 10 million tokens. TurboQuant — Google's recent method — uses random orthogonal rotation (PolarQuant) plus 1-bit Quantized Johnson-Lindenstrauss correction, claiming 6-8x memory reduction at 3-bit with no offline calibration step. Family three is architectural: Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) reduce KV cache size by design — multiple query heads share fewer key/value heads. GQA is now the de facto default in Llama 3, Mistral, and most open-weight models. DeepSeek's Multi-head Latent Attention (MLA) goes further: projects keys and values into a compressed latent vector during inference and reports 93.3% KV cache reduction in DeepSeek-V2 with no quality loss. Family four — Palu / LoRC's low-rank weight decomposition — applies group-head low-rank projection and is orthogonal to both quantization and eviction, meaning it can stack with the other families.

For builders, three takeaways. First, the right technique depends on which phase bottlenecks you. If prefill latency is the constraint (very long prompts), SnapKV and Pyramid-class methods help; if decode throughput is the constraint (long generations, lots of concurrent users), H2O, KIVI, and StreamingLLM dominate. If you're training a new model from scratch, the architectural fix (GQA/MLA) is the first lever — it's free at inference time and stacks with everything else. Second, watch which inference stacks integrate which techniques: vLLM, TensorRT-LLM, SGLang, llama.cpp, and TGI each have different supported sets, and the gap between "research paper claims X" and "production library ships X with kernels that work on your GPU" is real. Third, the inversion (KV cache > model weights) is the architectural reason why every frontier model release in 2025-2026 has shipped with attention modifications baked in (Llama 3's GQA, DeepSeek-V2/V3's MLA, Qwen3's hybrid GDN-plus-attention). The "open weights" you download now include implicit KV-cache-compression bets; if you're benchmarking models against each other, comparing inference cost requires measuring KV cache footprint at your specific batch size and sequence length, not just parameter count. The builder lesson: when memory is the bottleneck, the model weight count is no longer the right unit of comparison.

KV cache is now the inference bottleneck — a survey of 10 compression techniques (H2O, StreamingLLM, KIVI, KVQuant, TurboQuant, MLA, GQA, Palu) lays out the trade-off space

More News