Zubnet AIसीखेंWiki › Gradient Checkpointing
Training

Gradient Checkpointing

Activation Checkpointing, Rematerialization
एक memory-saving technique जो training के दौरान compute को memory के लिए trade करती है। Forward pass की सभी intermediate activations store करने के बजाय (backpropagation के लिए ज़रूरी), gradient checkpointing कुछ “checkpoint” layers पर ही activations store करता है और backward pass के दौरान बाकी recompute करता है। ये memory usage को 5–10x तक reduce करता है, ~30% ज़्यादा compute की cost पर।

यह क्यों matter करता है

Gradient checkpointing वो है जो limited GPU memory पर large models को fine-tune करना possible बनाता है। इसके बिना, एक 7B model को training के दौरान activations के लिए ही 80+ GB चाहिए हो सकती है, एक single GPU की capacity से आगे। Gradient checkpointing के साथ, वही model एक 24GB consumer GPU पर fine-tune हो सकता है। ये training के लिए सबसे commonly used memory optimization है।

Deep Dive

During the forward pass, each layer's input activations are needed during the backward pass to compute gradients. Normally, all activations are stored in memory. With gradient checkpointing, only certain layers' activations are stored. During the backward pass, when an unstored activation is needed, the forward pass is re-run from the nearest checkpoint to recompute it. This trades ~30% extra compute (recomputing activations) for ~5x memory savings (not storing them all).

Checkpoint Placement

The optimal placement of checkpoints depends on the model architecture. The simplest approach: checkpoint every N layers (e.g., every 3rd Transformer block). More sophisticated: analyze the activation sizes per layer and place checkpoints to minimize total memory while limiting recomputation. Some frameworks (PyTorch's torch.utils.checkpoint) make this as simple as wrapping a layer call in a checkpoint function.

Combining with Other Techniques

Gradient checkpointing composes with other memory optimizations: mixed precision (FP16/BF16 halves activation size), gradient accumulation (smaller batches reduce peak memory), and FSDP/DeepSpeed (shard parameters across GPUs). Together, these can reduce a model's memory footprint by 10–50x compared to naive FP32 training, enabling training of models that are far larger than any single GPU's memory. This stack of optimizations is standard for fine-tuning 7B+ models.

संबंधित अवधारणाएँ

← सभी Terms
← GQA Gradient Descent →