Adam Optimizer: Definition & Meaning — AI Wiki

訓練神經網路最廣泛使用的優化演算法。Adam(Adaptive Moment Estimation)把動量(使用過去梯度的滑動平均)和自適應學習率(按過去梯度量級的倒數縮放更新)結合起來。AdamW 加上解耦的 weight decay 以獲得更好的正則化。幾乎每個現代 LLM 都用 AdamW 訓練。

為什麼重要

Adam 在廣泛的任務和超參數上都表現好,讓它成為預設優化器。理解它解釋了為什麼訓練大多數時候「就是能運作」(Adam 按參數自適應),以及為什麼有時不行(Adam 的記憶體需求是模型參數的兩倍,這對大模型重要)。它也是「我該用哪個優化器?」在 90% 情況下的答案。

Deep Dive

Adam maintains two moving averages per parameter: the first moment (mean of gradients — momentum) and the second moment (mean of squared gradients — adaptive scaling). The update rule: parameter -= lr × m̂ / (√v̂ + ε), where m̂ and v̂ are bias-corrected moments. Parameters with consistently large gradients get smaller updates (they're already well-calibrated). Parameters with small, noisy gradients get larger updates (they need more aggressive movement).

AdamW: The Fix

The original Adam applied weight decay by adding it to the gradient before computing moments, which caused the decay to be scaled by the adaptive learning rate — not what you want. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient update, applying it directly to the parameters. This seems like a minor fix but significantly improves generalization. All modern LLM training uses AdamW.

Memory Cost

Adam stores two additional values per parameter (first and second moments), tripling the memory needed for optimizer state: a 70B model needs ~140 GB for weights (FP16) plus ~280 GB for Adam states (FP32), totaling ~420 GB. This is why optimizer state sharding (DeepSpeed ZeRO, FSDP) is essential for large model training. Some newer optimizers (Adafactor, CAME, Lion) reduce this memory overhead at some cost to stability.

Adam Optimizer

為什麼重要

Deep Dive

AdamW: The Fix

Memory Cost

相關概念