學習ing Rate Schedule: Definition & Meaning — AI Wiki

訓練中改變學習率而不保持恆定的策略。大多數現代訓練用 warmup(從接近零逐漸增加到峰值)後跟 decay(逐漸向零減少)。餘弦退火是最常見的 decay schedule。學習率控制每個梯度更新步子多大 — 可以說是訓練中最重要的超參數。

為什麼重要

學習率 schedule 對不對能成就或毀掉一次訓練。太高模型發散(loss 尖刺、訓練失敗)。太低訓練太慢或卡住。Schedule 與 batch size、模型大小、資料互動 — 沒有通用設定。理解學習率 schedule 幫你解讀訓練曲線、診斷訓練問題。

Deep Dive

The standard LLM training schedule has three phases: (1) warmup: linearly increase the learning rate from ~0 to the peak value over the first 0.1–2% of training steps. This prevents the randomly initialized model from taking too-large steps early on. (2) Stable/peak: maintain the peak learning rate for the bulk of training. (3) Decay: decrease the learning rate following a cosine curve to near-zero by the end. This lets the model make fine-grained adjustments in the final phase.

Cosine Annealing

Cosine decay: lr(t) = lr_min + 0.5 · (lr_max − lr_min) · (1 + cos(π · t / T)), where t is the current step and T is the total steps. This produces a smooth curve that decreases slowly at first, then faster, then slowly again as it approaches the minimum. Why cosine? It works well empirically and avoids the abrupt transitions of step-based schedules. The final learning rate is typically 10x smaller than the peak.

The 學習ing Rate-Batch Size Relationship

The linear scaling rule: if you double the batch size, double the learning rate. This preserves the effective step size when the gradient estimate becomes more accurate (from the larger batch). The rule holds approximately for moderate batch sizes but breaks down at very large batches, where the optimal learning rate grows slower than linearly. Getting this relationship right is critical for distributed training where batch size scales with the number of GPUs.

學習ing Rate Schedule

為什麼重要

Deep Dive

Cosine Annealing

The 學習ing Rate-Batch Size Relationship

相關概念