Zubnet AI学习Wiki › Checkpoint
Training

Checkpoint

Model Checkpoint, Snapshot
训练期间模型状态的保存快照 — 权重、优化器状态、学习率计划、训练步骤。Checkpoint 让你在中断后(硬件故障、抢占)恢复训练、评估中间版本模型、如果训练退化可以回滚到较早版本。每几千步保存 checkpoint 是标准做法。

为什么重要

训练大模型要几天到几个月。没有 checkpoint,一个 10 万步训练在 9 万步时 GPU 故障就要从头再来。Checkpoint 是保险:它们增量地保存进度,所以你只丢失上一个 checkpoint 之后的工作。它们也使模型选择成为可能 — 有时一个较早的 checkpoint 在你的评估指标上比最终的表现更好。

Deep Dive

A full checkpoint for a 70B model includes: model weights (~140 GB in FP16), optimizer states (~280 GB for Adam, which stores two moving averages per parameter), learning rate scheduler state, random number generator states, and the current training step. Total: ~420 GB per checkpoint. Saving this to disk takes significant time and storage, which is why checkpointing is done periodically rather than every step.

Checkpoint Strategies

Common strategies: save every N steps (simple but uses lots of storage), save only the K most recent checkpoints (deleting older ones to save space), save based on evaluation metrics (keep the checkpoint with the best validation loss), and use async checkpointing (save in the background while training continues on the next batch). Large training runs often use all of these: frequent local checkpoints on fast NVMe storage plus periodic remote checkpoints to network storage for disaster recovery.

Checkpoint Conversion

Different frameworks use different checkpoint formats: PyTorch's state_dict, Hugging Face's safetensors, FSDP's sharded checkpoints, and DeepSpeed's ZeRO checkpoints. Converting between formats is a common task — you might train with DeepSpeed (sharded across GPUs) but need a single consolidated checkpoint for inference or uploading to Hugging Face. The safetensors format is becoming the standard for sharing because it's fast to load and memory-safe.

相关概念

← 所有术语
← Chatbot Arena Classification →