Zubnet AI學習Wiki › Checkpoint
Training

Checkpoint

Model Checkpoint, Snapshot
訓練期間模型狀態的儲存快照 — 權重、優化器狀態、學習率計畫、訓練步驟。Checkpoint 讓你在中斷後(硬體故障、搶佔)恢復訓練、評估中間版本模型、如果訓練退化可以回滾到較早版本。每幾千步儲存 checkpoint 是標準做法。

為什麼重要

訓練大模型要幾天到幾個月。沒有 checkpoint,一個 10 萬步訓練在 9 萬步時 GPU 故障就要從頭再來。Checkpoint 是保險:它們增量地儲存進度,所以你只丟失上一個 checkpoint 之後的工作。它們也使模型選擇成為可能 — 有時一個較早的 checkpoint 在你的評估指標上比最終的表現更好。

Deep Dive

A full checkpoint for a 70B model includes: model weights (~140 GB in FP16), optimizer states (~280 GB for Adam, which stores two moving averages per parameter), learning rate scheduler state, random number generator states, and the current training step. Total: ~420 GB per checkpoint. Saving this to disk takes significant time and storage, which is why checkpointing is done periodically rather than every step.

Checkpoint Strategies

Common strategies: save every N steps (simple but uses lots of storage), save only the K most recent checkpoints (deleting older ones to save space), save based on evaluation metrics (keep the checkpoint with the best validation loss), and use async checkpointing (save in the background while training continues on the next batch). Large training runs often use all of these: frequent local checkpoints on fast NVMe storage plus periodic remote checkpoints to network storage for disaster recovery.

Checkpoint Conversion

Different frameworks use different checkpoint formats: PyTorch's state_dict, Hugging Face's safetensors, FSDP's sharded checkpoints, and DeepSpeed's ZeRO checkpoints. Converting between formats is a common task — you might train with DeepSpeed (sharded across GPUs) but need a single consolidated checkpoint for inference or uploading to Hugging Face. The safetensors format is becoming the standard for sharing because it's fast to load and memory-safe.

相關概念

← 所有術語
← Chatbot Arena Classification →