Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.
Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.
The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.