Google's TurboQuant quantizes LLM KV cache to 3.5 bits per channel with near-zero accuracy loss

KV cache is the memory elephant in long-context LLM serving. Every token produces a keys and values tensor that stays resident for the duration of the sequence, and at long context plus large models the cache routinely consumes 20 to 30 percent of total VRAM. Existing workarounds (grouped-query attention, PagedAttention, INT4/INT8 quantization) help but plateau. TurboQuant, published this week as arxiv 2504.19874 out of Google, claims roughly 4.5 to 5x compression versus FP16 baselines with near-zero accuracy loss, which if it holds in production is the most aggressive usable KV cache compression to date.

The trick is a two-stage pipeline. Stage one randomly rotates input vectors, which concentrates coordinate values into a Beta distribution and lets you apply an optimal scalar quantizer per coordinate. Stage two applies an MSE quantizer followed by a 1-bit Quantized Johnson-Lindenstrauss transform on the residual. The per-token storage comes down to quantization indices, sign bits, and an L2-norm scalar. At 3.5 bits per channel the paper reports "absolute quality neutrality," meaning the accuracy loss is statistically zero on their evaluation. At 2.5 bits per channel it reports "marginal quality degradation." The rotation step is the architectural insight: you pay a small compute cost to make the coordinate distribution quantization-friendly, then per-coordinate scalar quantization does the compression instead of the traditional per-group or per-tensor approaches.

For anyone serving LLMs with long context, the math is direct. If your current stack caches FP16 KV and you are VRAM-bound (the common case on single-node serving with 32k+ context), 4.5 to 5x compression translates to roughly 5x the concurrent requests at the same memory budget, or 5x the context length per request. The caveat is that the abstract does not enumerate which models were tested, so before rolling TurboQuant into production, verify the evaluation covers your model family and sequence lengths. The paper also targets nearest-neighbor search as a second application, which suggests the rotation-plus-quantization pattern generalizes beyond attention caches.

The practical path for a production serving team is to watch for reference implementations and benchmark against what you already run. TurboQuant slots into the same place in your inference stack where KIVI, KVQuant, or Atom would go, so the integration cost is similar. If you have already quantized KV cache, compare 3.5 bits per channel at zero quality loss against your current setup; that is a competitive floor for 2026. If you have not quantized yet, this paper is the best current argument for starting now. The broader trend is that KV cache compression is no longer an optional optimization. At long-context workloads it is the gating constraint, and frontier-lab research is rapidly converging on sub-4-bit schemes that preserve accuracy.

Google's TurboQuant quantizes LLM KV cache to 3.5 bits per channel with near-zero accuracy loss

More News