Together AI OSCAR: 2-bit KV cache 8x smaller, NIAH-128K 45% vs prior 0%

Together AI open-sourced OSCAR this week — a 2-bit KV cache quantization system that finally makes 2-bit usable for long-context serving. The acronym expands to Offline Spectral Covariance-Aware Rotation, and the key technical move is deriving the rotation matrices from empirical *attention* statistics rather than from raw activation distributions. Naive INT2 and QuaRot-INT2 applied generic Hadamard rotations that ignored what attention actually computes; OSCAR uses query covariance CQ for the key path (because attention-logit error depends on `tr((K − K̂)Qᵀ Q (K − K̂)ᵀ)`, not on reconstruction error), and score-weighted value covariance CS for the value path. The composite rotation `RK = UQ · HHad · Pbr` is query eigenvectors + Hadamard + bit-reversal permutation, engineered so quantization error lands in perceptually unimportant directions.

The numbers earn the release. KV cache memory cut by ~8×. Decode speedup 1.84-3.08× at 100K context for single requests, job-level throughput up to 7.83× at batch size 32. Accuracy gap to BF16 averaged across AIME25, GPQA-Diamond, HumanEval, LiveCodeBench and MATH500: Qwen3-4B-Thinking −3.78 points, Qwen3-8B −1.42, Qwen3-32B **−0.02**, GLM-4.7-FP8 (358B) **+0.27**. The pattern is the right one — accuracy gap closes as models scale up, which is what you want from a production-grade quantizer. Long-context where this matters most: RULER-NIAH on Qwen3-8B at 128K context, OSCAR hits **45.0%** vs QuaRot-INT2's **0.0%**. Prior 2-bit methods literally cannot do needle-in-haystack at long context; OSCAR can. Tested at 16K/32K/64K/128K with generation up to 32K tokens. Models: Qwen3-4B-Thinking, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8 (358B), MiniMax-M2.7.

System integration: OSCAR ships built into SGLang with full paged KV-cache and prefix-cache compatibility. The mixed-precision layout keeps BF16 for the attention sink (first 64 tokens) and recent window (last 256 tokens), with INT2 for the history in between. Fused Triton kernels handle rotation, clipping and quantization on write; dequantization and inverse-rotation on read. The value rotation gets absorbed into projection weights offline, so there's zero runtime cost for that half of the system. Pre-computed rotations live at ModelScope RotationZoo so most builders can clone-and-serve without running the calibration pass themselves. Repository: github.com/FutureMLS-Lab/OSCAR — flag that the article doesn't state the license explicitly, builders should check before commercial use.

Monday morning: if you're serving long-context Qwen3, GLM-4.7 or MiniMax-M2 in production and running into KV-cache memory ceilings, OSCAR is a drop-in test for SGLang deployments. The 8× memory reduction at near-zero accuracy cost on 32B+ models is the right unit economics for cost-pressure-at-scale (the same cost pressure that drove Microsoft to swap Claude Code for Copilot CLI earlier this week). Honest limitations: per-layer calibration is required (not a single universal rotation), the BF16 sink buffer is load-bearing (Table 5 shows accuracy degrades sharply without it), the Triton kernel path means vLLM and TensorRT-LLM integration isn't there yet, and the article doesn't disclose what license the code ships under. For builders on vLLM, this is the architectural primitive to port — the attention-aware rotation idea is reproducible from the paper independent of the SGLang implementation.

Together AI OSCAR: 2-bit KV cache 8x smaller, NIAH-128K 45% vs prior 0%

More News