Apprendreing Rate Schedule: Definition & Meaning — AI Wiki

Une stratégie pour changer le learning rate pendant l'entraînement plutôt que de le garder constant. La plupart de l'entraînement moderne utilise warmup (augmentation graduelle de près de zéro au pic) suivi de decay (décroissance graduelle vers zéro). Le cosine annealing est le schedule de decay le plus commun. Le learning rate contrôle à quel point chaque étape d'update de gradient est grande — sans doute l'hyperparamètre le plus important en entraînement.

Pourquoi c'est important

Avoir le schedule de learning rate correct peut faire ou casser un run d'entraînement. Trop haut et le modèle diverge (la loss spike, l'entraînement échoue). Trop bas et il s'entraîne trop lentement ou se coince. Le schedule interagit avec le batch size, la taille du modèle et les données — il n'y a pas de réglage universel. Comprendre les schedules de learning rate t'aide à interpréter les courbes d'entraînement et diagnostiquer les issues d'entraînement.

Deep Dive

The standard LLM training schedule has three phases: (1) warmup: linearly increase the learning rate from ~0 to the peak value over the first 0.1–2% of training steps. This prevents the randomly initialized model from taking too-large steps early on. (2) Stable/peak: maintain the peak learning rate for the bulk of training. (3) Decay: decrease the learning rate following a cosine curve to near-zero by the end. This lets the model make fine-grained adjustments in the final phase.

Cosine Annealing

Cosine decay: lr(t) = lr_min + 0.5 · (lr_max − lr_min) · (1 + cos(π · t / T)), where t is the current step and T is the total steps. This produces a smooth curve that decreases slowly at first, then faster, then slowly again as it approaches the minimum. Why cosine? It works well empirically and avoids the abrupt transitions of step-based schedules. The final learning rate is typically 10x smaller than the peak.

The Apprendreing Rate-Batch Size Relationship

The linear scaling rule: if you double the batch size, double the learning rate. This preserves the effective step size when the gradient estimate becomes more accurate (from the larger batch). The rule holds approximately for moderate batch sizes but breaks down at very large batches, where the optimal learning rate grows slower than linearly. Getting this relationship right is critical for distributed training where batch size scales with the number of GPUs.

Apprendreing Rate Schedule

Pourquoi c'est important

Deep Dive

Cosine Annealing

The Apprendreing Rate-Batch Size Relationship

Concepts liés