Zubnet AI學習Wiki › Gradient Descent
Training

Gradient Descent

SGD, Stochastic Gradient Descent, Backpropagation
透過迭代調整參數以減少損失函數來訓練神經網路的演算法。它的工作方式是:計算損失相對於每個參數的梯度(最陡增長方向),然後把每個參數往相反方向(下坡)移動一小步。反向傳播是用來高效計算這些梯度穿過網路各層的技術。

為什麼重要

梯度下降是所有深度學習引擎蓋下的引擎。你用的每個模型 — 每個 LLM、每個影像生成器、每個 embedding 模型 — 都是用梯度下降訓練的。理解它能幫你理解訓練動態:為什麼學習率重要、為什麼訓練會發散或卡住、為什麼 Adam 這樣的現代優化器比樸素的梯度下降更好。

Deep Dive

The full algorithm: (1) take a batch of training examples, (2) run them through the model to get predictions, (3) compute the loss, (4) use backpropagation to compute the gradient of the loss with respect to every parameter, (5) update each parameter by subtracting the gradient times a learning rate, (6) repeat. In practice, "stochastic" gradient descent (SGD) uses random mini-batches rather than the full dataset, which is both computationally necessary (the full dataset doesn't fit in memory) and beneficial (the noise from random batches helps escape local minima).

Adam and Modern Optimizers

Plain SGD is rarely used today. Adam (Adaptive Moment Estimation) maintains a running average of both the gradient and its squared magnitude for each parameter, effectively giving each parameter its own adaptive learning rate. Parameters with consistently large gradients get smaller updates (they're already well-calibrated), while parameters with small, noisy gradients get larger updates (they need more aggressive movement). AdamW adds weight decay for regularization. Most LLM training uses AdamW or variants.

The 學習ing Rate

The learning rate is arguably the single most important hyperparameter in training. Too high and the model overshoots the minimum, loss diverges, and training fails. Too low and training takes forever or gets stuck. Modern training uses learning rate schedules: start with a warmup phase (gradually increasing from near-zero), reach a peak, then decay (cosine annealing is common). The peak learning rate, warmup duration, and decay schedule all interact with batch size and model architecture. Getting this right is a significant part of training large models.

相關概念

← 所有術語
← Gradient Checkpointing Groq →