Gradient Checkpointing: Definition & Meaning — AI Wiki

Une technique d'économie mémoire qui échange du compute contre de la mémoire pendant l'entraînement. Au lieu de stocker toutes les activations intermédiaires de la forward pass (nécessaires pour la backpropagation), le gradient checkpointing stocke seulement les activations à certaines couches « checkpoint » et recompute les autres pendant la backward pass. Ça réduit l'utilisation mémoire jusqu'à 5–10x au coût de ~30 % de compute en plus.

Pourquoi c'est important

Le gradient checkpointing est ce qui rend possible le fine-tuning de gros modèles sur de la mémoire GPU limitée. Sans lui, un modèle 7B pourrait avoir besoin de 80+ Go juste pour les activations pendant l'entraînement, dépassant la capacité d'un seul GPU. Avec le gradient checkpointing, le même modèle peut être fine-tuné sur un GPU grand public 24 Go. C'est l'optimisation mémoire la plus communément utilisée pour l'entraînement.

Deep Dive

During the forward pass, each layer's input activations are needed during the backward pass to compute gradients. Normally, all activations are stored in memory. With gradient checkpointing, only certain layers' activations are stored. During the backward pass, when an unstored activation is needed, the forward pass is re-run from the nearest checkpoint to recompute it. This trades ~30% extra compute (recomputing activations) for ~5x memory savings (not storing them all).

Checkpoint Placement

The optimal placement of checkpoints depends on the model architecture. The simplest approach: checkpoint every N layers (e.g., every 3rd Transformer block). More sophisticated: analyze the activation sizes per layer and place checkpoints to minimize total memory while limiting recomputation. Some frameworks (PyTorch's torch.utils.checkpoint) make this as simple as wrapping a layer call in a checkpoint function.

Combining with Other Techniques

Gradient checkpointing composes with other memory optimizations: mixed precision (FP16/BF16 halves activation size), gradient accumulation (smaller batches reduce peak memory), and FSDP/DeepSpeed (shard parameters across GPUs). Together, these can reduce a model's memory footprint by 10–50x compared to naive FP32 training, enabling training of models that are far larger than any single GPU's memory. This stack of optimizations is standard for fine-tuning 7B+ models.

Gradient Checkpointing

Pourquoi c'est important

Deep Dive

Checkpoint Placement

Combining with Other Techniques

Concepts liés