Zubnet AI學習Wiki › Distillation
Training

Distillation

Knowledge Distillation, Model Distillation
訓練一個更小的「學生」模型來模仿更大的「老師」模型的行為。不是用硬標籤(貓/狗)的原始資料訓練學生,而是用老師的軟機率分佈(70% 貓,20% 狗,10% 狐狸)來訓練。軟輸出比硬標籤承載更多資訊,因為它們編碼了老師的不確定性和類別之間的關係。

為什麼重要

蒸餾是產業讓強大 AI 變得可用的方法。一個 700 億參數的模型對即時應用來說可能太大太貴,但從它蒸餾出的 7B 模型能以 10% 的成本捕獲 90% 的能力。很多人在本地跑的小而快的模型都是從更大的前沿模型蒸餾出來的。

Deep Dive

The original insight from Hinton et al. (2015) was that a teacher's output probabilities contain "dark knowledge" — information about which wrong answers are almost right. A digit classifier that sees a "7" might output 0.8 for "7" but 0.15 for "1" and 0.03 for "9" — revealing that 7s look more like 1s than 9s. A student trained on these soft targets learns these relationships, which hard labels ("it's a 7, period") don't convey.

In the LLM Era

For LLMs, distillation takes several forms. The most common is training a smaller model on outputs generated by a larger model — you run the teacher on a large set of prompts, collect its responses, and fine-tune the student on those (prompt, response) pairs. This is sometimes called "distillation through generation." It's controversial because some model licenses prohibit using outputs to train competing models, and because it can create models that sound confident but lack the teacher's deeper reasoning abilities.

Distillation vs. Quantization

People sometimes confuse distillation with quantization. Quantization shrinks a model by reducing numerical precision (32-bit to 4-bit) — same model, smaller numbers. Distillation creates an entirely new, architecturally smaller model — fewer layers, smaller dimensions — that has learned from the teacher. They're complementary: you can distill a 70B model into a 7B model and then quantize the 7B model to make it even smaller.

相關概念

← 所有術語
← Diffusion Transformer Distributed Training →