Zubnet AIAprenderWiki › Checkpoint
Training

Checkpoint

Model Checkpoint, Snapshot
Un snapshot guardado del estado de un modelo durante el entrenamiento — los pesos, estado del optimizador, schedule de learning rate y paso de entrenamiento. Los checkpoints te dejan reanudar el entrenamiento después de interrupciones (falla de hardware, preemption), evaluar versiones intermedias del modelo, y hacer rollback a una versión anterior si el entrenamiento se degrada. Guardar checkpoints cada pocos miles de pasos es práctica estándar.

Por qué importa

Entrenar modelos grandes toma días a meses. Sin checkpoints, una falla de GPU en el paso 90,000 de un run de 100,000 pasos significa empezar de nuevo. Los checkpoints son seguro: guardan progreso incrementalmente así solo pierdes el trabajo desde el último checkpoint. También habilitan selección de modelo — a veces un checkpoint anterior se desempeña mejor en tus métricas de evaluación que el final.

Deep Dive

A full checkpoint for a 70B model includes: model weights (~140 GB in FP16), optimizer states (~280 GB for Adam, which stores two moving averages per parameter), learning rate scheduler state, random number generator states, and the current training step. Total: ~420 GB per checkpoint. Saving this to disk takes significant time and storage, which is why checkpointing is done periodically rather than every step.

Checkpoint Strategies

Common strategies: save every N steps (simple but uses lots of storage), save only the K most recent checkpoints (deleting older ones to save space), save based on evaluation metrics (keep the checkpoint with the best validation loss), and use async checkpointing (save in the background while training continues on the next batch). Large training runs often use all of these: frequent local checkpoints on fast NVMe storage plus periodic remote checkpoints to network storage for disaster recovery.

Checkpoint Conversion

Different frameworks use different checkpoint formats: PyTorch's state_dict, Hugging Face's safetensors, FSDP's sharded checkpoints, and DeepSpeed's ZeRO checkpoints. Converting between formats is a common task — you might train with DeepSpeed (sharded across GPUs) but need a single consolidated checkpoint for inference or uploading to Hugging Face. The safetensors format is becoming the standard for sharing because it's fast to load and memory-safe.

Conceptos relacionados

← Todos los términos
← Chatbot Arena Classification →