Zubnet AILearnWiki › Pruning
Training

Pruning

Model Pruning, Weight Pruning
Removing unnecessary parameters (weights, neurons, or entire layers) from a trained model to make it smaller and faster without significant quality loss. Like pruning a tree: cut the branches that contribute least and the tree stays healthy. Structured pruning removes entire neurons or attention heads. Unstructured pruning zeros out individual weights.

Why it matters

Pruning is a model compression technique alongside quantization and distillation. The key insight: most neural networks are overparameterized — many weights contribute little to the output. The "lottery ticket hypothesis" suggests that within a large network, there exists a much smaller subnetwork that can match the original's performance. Pruning finds and keeps that subnetwork.

Deep Dive

Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.

Structured Pruning

Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.

Pruning + Quantization + Distillation

The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.

Related Concepts

← All Terms
← Prompt Template Quantization →