Zubnet AI學習Wiki › Speculative Decoding
基礎設施

Speculative Decoding

Assisted Generation, Draft-and-Verify
一種速度優化,一個小的、快的「草稿」模型生成幾個候選 token,然後大的目標模型在單次前向傳播中驗證它們。如果草稿模型猜對了(對可預測的 token 經常對),多個 token 一次被接受,跳過大模型慢的逐 token 生成。草稿錯了時,大模型從那個點開始糾正。

為什麼重要

Speculative decoding 能讓 LLM 推理加速 2–3 倍,輸出品質不損失 — 最終輸出和大模型獨自會產出的在數學上完全相同。這是 AI 推理優化中少有的免費午餐,所以被供應商和框架廣泛採用。

Deep Dive

The key insight is that verifying a draft is much faster than generating from scratch. During normal autoregressive generation, each token requires a full serial forward pass through the model. But the model can process multiple tokens in parallel during a single forward pass (like it does with your prompt). So if you have a draft of 5 tokens, the large model can check all 5 in roughly the time it would take to generate 1. If 4 out of 5 are correct, you've generated 4 tokens for the cost of 1+1 (draft generation + verification).

Choosing the Draft Model

The draft model should be much smaller and faster than the target model, but similar enough to agree on most tokens. A common approach: use a model from the same family but smaller (Llama 70B verified by Llama 8B drafts). Some systems use the target model's own early layers as a draft model (self-speculative decoding). The acceptance rate — what fraction of draft tokens the target model agrees with — determines the speedup. Typical acceptance rates of 70–85% yield 2–3x throughput improvements.

When It Helps Most

Speculative decoding helps most when the text is predictable (boilerplate, code with common patterns, structured output) and helps least when every token is surprising (creative writing, complex reasoning). It also helps more when the bottleneck is latency rather than throughput — if you're serving many concurrent requests, the GPU is already busy and the parallelism gains are smaller.

相關概念

← 所有術語
← Sparse Autoencoder Speech Recognition →