EAGLE 3.1 fixes speculative decoding attention drift, merged into vLLM 0.22

The EAGLE team, vLLM, and TorchSpec jointly released EAGLE 3.1, fixing a real production bug in speculative decoding: as speculation depth increases, the drafter model shifts attention away from sink tokens and toward its own generated tokens, degrading acceptance length and output stability. The fix is two architectural changes — FC normalization applied after each target hidden state and before the FC layer to bound hidden-state magnitudes, plus post-norm hidden-state feedback so the drafter behaves like recursive invocation rather than appended layers. Already merged into vLLM main, shipping in v0.22.0, backward compatible with existing EAGLE 3 checkpoints.

The reported gains are concrete. On long-context workloads, up to 2× longer acceptance length vs EAGLE 3. On Kimi K2.6-NVFP4 SPEED-Bench coding, per-user throughput lifts of 2.03× at concurrency 1, 1.71× at concurrency 4, and 1.66× at concurrency 16. The pattern — biggest lift at low concurrency, narrowing as concurrency rises — is what builders should expect from any speculative decoding gain: speculative decoding wins when the model is bottlenecked on memory bandwidth per request, which is the regime at low concurrency. At high concurrency you are bottlenecked on aggregate throughput and the speculative win is smaller. No head-to-heads against Medusa or vanilla draft-model baselines are shown in the release, which is the methodology gap to flag.

The ecosystem read sits in the integration path more than the numbers. EAGLE has been the production speculative decoding family for two years; vLLM is the default inference engine for self-hosted LLMs; TorchSpec provides the training side. When the three converge on a release that fixes a known instability with a backward-compatible algorithmic change, that is the inference stack reducing its load-bearing variance, not adding a feature. The open-sourced draft model for Kimi K2.6 on HuggingFace means builders on Kimi already have the artifact; for other base models, the training-side work is on TorchSpec. Agentic loops with growing context windows are where attention drift hurt the most — long agent traces, code completion in long files, document QA — and those are exactly the workloads where 2× acceptance length translates to user-visible latency wins.

If you run vLLM in production: schedule the 0.22.0 upgrade and re-train your draft models on TorchSpec when you can. If you build inference SaaS: this is the change that quietly improves the long-context cost curve for everyone using your stack.

EAGLE 3.1 fixes speculative decoding attention drift, merged into vLLM 0.22

More News