Zubnet AILearnWiki › Gradient Checkpointing
Training

Gradient Checkpointing

Activation Checkpointing, Rematerialization
A memory-saving technique that trades compute for memory during training. Instead of storing all intermediate activations from the forward pass (needed for backpropagation), gradient checkpointing only stores activations at certain "checkpoint" layers and recomputes the others during the backward pass. This reduces memory usage by up to 5–10x at the cost of ~30% more compute.

Why it matters

Gradient checkpointing is what makes it possible to fine-tune large models on limited GPU memory. Without it, a 7B model might need 80+ GB just for activations during training, exceeding a single GPU's capacity. With gradient checkpointing, the same model can be fine-tuned on a 24GB consumer GPU. It's the most commonly used memory optimization for training.

Deep Dive

During the forward pass, each layer's input activations are needed during the backward pass to compute gradients. Normally, all activations are stored in memory. With gradient checkpointing, only certain layers' activations are stored. During the backward pass, when an unstored activation is needed, the forward pass is re-run from the nearest checkpoint to recompute it. This trades ~30% extra compute (recomputing activations) for ~5x memory savings (not storing them all).

Checkpoint Placement

The optimal placement of checkpoints depends on the model architecture. The simplest approach: checkpoint every N layers (e.g., every 3rd Transformer block). More sophisticated: analyze the activation sizes per layer and place checkpoints to minimize total memory while limiting recomputation. Some frameworks (PyTorch's torch.utils.checkpoint) make this as simple as wrapping a layer call in a checkpoint function.

Combining with Other Techniques

Gradient checkpointing composes with other memory optimizations: mixed precision (FP16/BF16 halves activation size), gradient accumulation (smaller batches reduce peak memory), and FSDP/DeepSpeed (shard parameters across GPUs). Together, these can reduce a model's memory footprint by 10–50x compared to naive FP32 training, enabling training of models that are far larger than any single GPU's memory. This stack of optimizations is standard for fine-tuning 7B+ models.

Related Concepts

← All Terms
← GQA Gradient Descent →