Gradient Checkpointing: Definition & Meaning — AI Wiki

训练时用计算换内存的省内存技术。不是存前向传播的所有中间激活(反向传播需要的),gradient checkpointing 只在某些“检查点”层存激活,反向传播时重算其他的。这把内存使用减少 5–10 倍,代价是约 30% 更多计算。

为什么重要

gradient checkpointing 就是让在有限 GPU 内存上 fine-tune 大模型成为可能的东西。没有它,一个 7B 模型训练时光激活就可能需要 80+ GB,超过单 GPU 容量。有了 gradient checkpointing,同样的模型能在 24GB 消费级 GPU 上 fine-tune。它是训练中最常用的内存优化。

Deep Dive

During the forward pass, each layer's input activations are needed during the backward pass to compute gradients. Normally, all activations are stored in memory. With gradient checkpointing, only certain layers' activations are stored. During the backward pass, when an unstored activation is needed, the forward pass is re-run from the nearest checkpoint to recompute it. This trades ~30% extra compute (recomputing activations) for ~5x memory savings (not storing them all).

Checkpoint Placement

The optimal placement of checkpoints depends on the model architecture. The simplest approach: checkpoint every N layers (e.g., every 3rd Transformer block). More sophisticated: analyze the activation sizes per layer and place checkpoints to minimize total memory while limiting recomputation. Some frameworks (PyTorch's torch.utils.checkpoint) make this as simple as wrapping a layer call in a checkpoint function.

Combining with Other Techniques

Gradient checkpointing composes with other memory optimizations: mixed precision (FP16/BF16 halves activation size), gradient accumulation (smaller batches reduce peak memory), and FSDP/DeepSpeed (shard parameters across GPUs). Together, these can reduce a model's memory footprint by 10–50x compared to naive FP32 training, enabling training of models that are far larger than any single GPU's memory. This stack of optimizations is standard for fine-tuning 7B+ models.

Gradient Checkpointing

为什么重要

Deep Dive

Checkpoint Placement

Combining with Other Techniques

相关概念