Pruning: Definition & Meaning — AI Wiki

從訓練好的模型中移除不必要的參數(權重、神經元、或整層)讓它變小變快,沒有顯著品質損失。像修剪樹:剪掉貢獻最少的枝條,樹仍然健康。結構化剪枝移除整個神經元或注意力頭。非結構化剪枝把個別權重清零。

為什麼重要

剪枝是和量化、蒸餾並列的模型壓縮技術。關鍵洞見:大多數神經網路都過參數化了 — 許多權重對輸出貢獻很小。「彩券假設」說大網路裡存在一個小得多的子網路,能匹敵原網路的性能。剪枝找到並保留這個子網路。

Deep Dive

Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.

Structured Pruning

Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.

Pruning + Quantization + Distillation

The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.

Pruning

為什麼重要

Deep Dive

Structured Pruning

Pruning + Quantization + Distillation

相關概念