Pruning: Definition & Meaning — AI Wiki

从训练好的模型中移除不必要的参数(权重、神经元、或整层)让它变小变快,没有显著质量损失。像修剪树:剪掉贡献最少的枝条,树仍然健康。结构化剪枝移除整个神经元或注意力头。非结构化剪枝把个别权重清零。

为什么重要

剪枝是和量化、蒸馏并列的模型压缩技术。关键洞见:大多数神经网络都过参数化了 — 许多权重对输出贡献很小。“彩票假设”说大网络里存在一个小得多的子网络,能匹敌原网络的性能。剪枝找到并保留这个子网络。

Deep Dive

Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.

Structured Pruning

Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.

Pruning + Quantization + Distillation

The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.

Pruning

为什么重要

Deep Dive

Structured Pruning

Pruning + Quantization + Distillation

相关概念