Google released Multi-Token Prediction (MTP) drafters for Gemma 4 today โ pre-trained lightweight drafter models that pair with the target Gemma to do speculative decoding out of the box. Headline claim: up to 3x faster inference with token-by-token identical output to the target model. The drafter proposes a sequence of future tokens; the target verifies them in parallel. When the verification rejects a draft token, generation falls back to the target's actual prediction at that position, so quality is preserved bit-exactly. The architectural detail that matters: the drafters share the target's KV cache and activations, which sidesteps the standard speculative-decoding overhead of running two independent models with separate cache states. Edge variants (E2B, E4B) get an "efficient clustering technique in the embedder layer" to address the logit calculation bottleneck that dominates small-model inference. Apache 2.0, weights on Hugging Face and Kaggle.
Speculative decoding has been the hot inference optimization for two years, but in practice, builders have either had to train their own drafters (significant work) or use generic small-model drafters that don't capture the target's distribution well (mediocre acceptance rates). Google shipping pre-trained drafters specifically tuned for Gemma 4 closes that gap โ drop-in 3x speedup with no training cost on the builder side. The KV-cache sharing is the architecturally meaningful choice: standard speculative decoding implementations like vLLM's pair an arbitrary draft model with the target and pay duplicated cache costs. Sharing KV state means lower memory footprint and faster verification rounds. Comparison to EAGLE (which uses the target's hidden states for drafting) and Medusa (which adds prediction heads to the target) isn't disclosed in the launch coverage; from the description, MTP drafters look closer to EAGLE in spirit but with separate lightweight drafter weights rather than additional target heads.
The ecosystem read: speculative decoding is becoming a baseline expectation for production inference on open-weight models, and labs that ship pre-trained drafters alongside their main checkpoints lower the barrier meaningfully. DeepSeek V3 shipped MTP heads built into the model. Mistral Medium 3.5's coding tier sits adjacent to this, though the drafter approach there isn't disclosed. Google making the drafters separate-but-cache-sharing modules is the design choice that lets builders pull just the drafter for their existing Gemma 4 deployment rather than reload a unified MTP-enabled checkpoint. For builders running self-hosted Gemma 4 in production, the upgrade path is: download the matching MTP drafter, plug into your inference framework if it supports KV-shared speculative decoding (vLLM and TensorRT-LLM both do, with config), measure acceptance rate on your traffic. Acceptance rate determines actual speedup โ 3x is the optimistic case, real-world is workload-dependent.
Practical move: if you're running Gemma 4 in production for chat, code completion, or low-latency inference, this is the optimization to test this week. Pull the MTP drafter, swap into your inference stack, measure latency and acceptance rate on your actual prompts. The "no quality loss" claim is verifiable token-by-token by comparing outputs against the non-MTP target โ run that diff on a sample of production requests as your sanity check. For edge deployment of Gemma 4 E2B/E4B, the embedder-layer clustering optimization specifically targets the logit-calc bottleneck that limits small-model latency on mobile/edge silicon โ that's the case where speculative decoding usually doesn't pay off, and Google's fix is the architectural detail to read carefully if you ship Gemma 4 on-device. The Apache 2.0 license keeps the commercial path open without negotiation friction. The next watch is whether other open-weight labs follow with pre-trained drafter modules โ once it's table stakes, the speculative-decoding-from-scratch tax disappears across the open ecosystem.
