Pruning: Definition & Meaning — AI Wiki

एक trained model से unnecessary parameters (weights, neurons, या entire layers) remove करना उसे smaller और faster बनाने के लिए significant quality loss के बिना। एक tree को prune करने की तरह: उन branches को cut करो जो सबसे कम contribute करती हैं और tree healthy रहता है। Structured pruning entire neurons या attention heads remove करती है। Unstructured pruning individual weights को zero out करती है।

यह क्यों matter करता है

Pruning quantization और distillation के alongside एक model compression technique है। Key insight: अधिकांश neural networks overparameterized हैं — कई weights output में कम contribute करते हैं। “Lottery ticket hypothesis” suggest करती है कि एक large network के अंदर, एक बहुत छोटा subnetwork exist करता है जो original की performance match कर सकता है। Pruning उस subnetwork को ढूँढता है और keep करता है।

Deep Dive

Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.

Structured Pruning

Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.

Pruning + Quantization + Distillation

The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.

Pruning

यह क्यों matter करता है

Deep Dive

Structured Pruning

Pruning + Quantization + Distillation

संबंधित अवधारणाएँ