Pruning: Definition & Meaning — AI Wiki

Enlever les paramètres inutiles (poids, neurones ou couches entières) d'un modèle entraîné pour le rendre plus petit et plus rapide sans perte significative de qualité. Comme tailler un arbre : coupe les branches qui contribuent le moins et l'arbre reste en santé. Le pruning structuré enlève des neurones entiers ou des têtes d'attention. Le pruning non structuré met à zéro des poids individuels.

Pourquoi c'est important

Le pruning est une technique de compression de modèle à côté de la quantization et de la distillation. L'intuition clé : la plupart des réseaux de neurones sont surparamétrés — beaucoup de poids contribuent peu à la sortie. L'« hypothèse du ticket de loterie » suggère qu'à l'intérieur d'un gros réseau, il existe un sous-réseau beaucoup plus petit qui peut égaler la performance de l'original. Le pruning trouve et garde ce sous-réseau.

Deep Dive

Unstructured pruning sets individual weights to zero based on magnitude (smallest weights contribute least). This creates sparse weight matrices. The challenge: standard hardware doesn't efficiently handle sparse computations, so a model that's 50% pruned doesn't run 2x faster on a GPU — the speedup requires specialized sparse computation libraries or hardware. This limits unstructured pruning's practical benefit.

Structured Pruning

Structured pruning removes entire neurons, attention heads, or layers. This produces a smaller dense model that runs faster on standard hardware without needing sparse computation support. Research shows that many attention heads are redundant — removing 20–40% of heads in a Transformer often has minimal impact on performance. Some heads consistently contribute more than others, and the important heads can be identified through gradient-based importance scores.

Pruning + Quantization + Distillation

The three compression techniques compose well: prune redundant parameters, quantize the remaining weights to lower precision, and optionally distill from the original model to recover any quality loss. This pipeline can reduce a model to 10–20% of its original size while retaining 95%+ of its capability. The order matters: typically prune first, then quantize the pruned model, then fine-tune to recover quality.

Pruning

Pourquoi c'est important

Deep Dive

Structured Pruning

Pruning + Quantization + Distillation

Concepts liés