Zubnet AIAprenderWiki › Gradient Checkpointing
Training

Gradient Checkpointing

Activation Checkpointing, Rematerialization
Una técnica de ahorro de memoria que intercambia compute por memoria durante el entrenamiento. En vez de almacenar todas las activaciones intermedias del forward pass (necesarias para backpropagation), gradient checkpointing solo almacena activaciones en ciertas capas «checkpoint» y recomputa las otras durante el backward pass. Esto reduce el uso de memoria hasta 5–10x al costo de ~30% más de compute.

Por qué importa

Gradient checkpointing es lo que hace posible fine-tunear modelos grandes en memoria GPU limitada. Sin él, un modelo 7B podría necesitar 80+ GB solo para activaciones durante el entrenamiento, excediendo la capacidad de un solo GPU. Con gradient checkpointing, el mismo modelo puede ser fine-tuneado en un GPU de consumidor de 24GB. Es la optimización de memoria más comúnmente usada para entrenamiento.

Deep Dive

During the forward pass, each layer's input activations are needed during the backward pass to compute gradients. Normally, all activations are stored in memory. With gradient checkpointing, only certain layers' activations are stored. During the backward pass, when an unstored activation is needed, the forward pass is re-run from the nearest checkpoint to recompute it. This trades ~30% extra compute (recomputing activations) for ~5x memory savings (not storing them all).

Checkpoint Placement

The optimal placement of checkpoints depends on the model architecture. The simplest approach: checkpoint every N layers (e.g., every 3rd Transformer block). More sophisticated: analyze the activation sizes per layer and place checkpoints to minimize total memory while limiting recomputation. Some frameworks (PyTorch's torch.utils.checkpoint) make this as simple as wrapping a layer call in a checkpoint function.

Combining with Other Techniques

Gradient checkpointing composes with other memory optimizations: mixed precision (FP16/BF16 halves activation size), gradient accumulation (smaller batches reduce peak memory), and FSDP/DeepSpeed (shard parameters across GPUs). Together, these can reduce a model's memory footprint by 10–50x compared to naive FP32 training, enabling training of models that are far larger than any single GPU's memory. This stack of optimizations is standard for fine-tuning 7B+ models.

Conceptos relacionados

← Todos los términos
← GQA Gradient Descent →