Zubnet AIApprendreWiki › Speculative Decoding
Infrastructure

Speculative Decoding

Assisted Generation, Draft-and-Verify
Une optimisation de vitesse où un petit modèle « draft » rapide génère plusieurs tokens candidats, puis le gros modèle cible les vérifie tous dans une seule forward pass. Si le draft model a bien deviné (ce qui arrive souvent pour les tokens prévisibles), plusieurs tokens sont acceptés d'un coup, sautant la génération lente token par token du gros modèle. Quand le draft se trompe, le gros modèle corrige à partir de ce point.

Pourquoi c'est important

Le speculative decoding peut accélérer l'inférence LLM de 2–3x sans aucune perte de qualité de sortie — la sortie finale est mathématiquement identique à ce que le gros modèle aurait produit seul. C'est un des rares free lunches en optimisation d'inférence IA, c'est pourquoi c'est largement adopté par les fournisseurs et frameworks.

Deep Dive

The key insight is that verifying a draft is much faster than generating from scratch. During normal autoregressive generation, each token requires a full serial forward pass through the model. But the model can process multiple tokens in parallel during a single forward pass (like it does with your prompt). So if you have a draft of 5 tokens, the large model can check all 5 in roughly the time it would take to generate 1. If 4 out of 5 are correct, you've generated 4 tokens for the cost of 1+1 (draft generation + verification).

Choosing the Draft Model

The draft model should be much smaller and faster than the target model, but similar enough to agree on most tokens. A common approach: use a model from the same family but smaller (Llama 70B verified by Llama 8B drafts). Some systems use the target model's own early layers as a draft model (self-speculative decoding). The acceptance rate — what fraction of draft tokens the target model agrees with — determines the speedup. Typical acceptance rates of 70–85% yield 2–3x throughput improvements.

When It Helps Most

Speculative decoding helps most when the text is predictable (boilerplate, code with common patterns, structured output) and helps least when every token is surprising (creative writing, complex reasoning). It also helps more when the bottleneck is latency rather than throughput — if you're serving many concurrent requests, the GPU is already busy and the parallelism gains are smaller.

Concepts liés

← Tous les termes
← Sparse Autoencoder Speech Recognition →