Zubnet AIसीखेंWiki › सीखेंing Rate Schedule
Training

सीखेंing Rate Schedule

LR Schedule, Warmup, Cosine Annealing
Training के दौरान learning rate को constant रखने के बजाय बदलने की एक strategy। अधिकांश modern training warmup (near-zero से peak तक gradually increase) के बाद decay (gradually zero की ओर decrease) use करती है। Cosine annealing सबसे common decay schedule है। Learning rate control करता है कि हर gradient update step कितना बड़ा है — arguably training में सबसे important hyperparameter।

यह क्यों matter करता है

Learning rate schedule सही करना एक training run को make या break कर सकता है। बहुत high और model diverge करता है (loss spikes, training fails)। बहुत low और ये बहुत slowly train करता है या stuck हो जाता है। Schedule batch size, model size, और data के साथ interact करता है — कोई universal setting नहीं है। Learning rate schedules समझना आपको training curves interpret करने और training issues diagnose करने में help करता है।

Deep Dive

The standard LLM training schedule has three phases: (1) warmup: linearly increase the learning rate from ~0 to the peak value over the first 0.1–2% of training steps. This prevents the randomly initialized model from taking too-large steps early on. (2) Stable/peak: maintain the peak learning rate for the bulk of training. (3) Decay: decrease the learning rate following a cosine curve to near-zero by the end. This lets the model make fine-grained adjustments in the final phase.

Cosine Annealing

Cosine decay: lr(t) = lr_min + 0.5 · (lr_max − lr_min) · (1 + cos(π · t / T)), where t is the current step and T is the total steps. This produces a smooth curve that decreases slowly at first, then faster, then slowly again as it approaches the minimum. Why cosine? It works well empirically and avoids the abrupt transitions of step-based schedules. The final learning rate is typically 10x smaller than the peak.

The सीखेंing Rate-Batch Size Relationship

The linear scaling rule: if you double the batch size, double the learning rate. This preserves the effective step size when the gradient estimate becomes more accurate (from the larger batch). The rule holds approximately for moderate batch sizes but breaks down at very large batches, where the optimal learning rate grows slower than linearly. Getting this relationship right is critical for distributed training where batch size scales with the number of GPUs.

संबंधित अवधारणाएँ

← सभी Terms
← Layer Leonardo.ai →