Adam Optimizer: Definition & Meaning — AI Wiki

训练神经网络最广泛使用的优化算法。Adam(Adaptive Moment Estimation)把动量(使用过去梯度的滑动平均)和自适应学习率(按过去梯度量级的倒数缩放更新)结合起来。AdamW 加上解耦的 weight decay 以获得更好的正则化。几乎每个现代 LLM 都用 AdamW 训练。

为什么重要

Adam 在广泛的任务和超参数上都表现好,让它成为默认优化器。理解它解释了为什么训练大多数时候“就是能工作”(Adam 按参数自适应),以及为什么有时不行(Adam 的内存需求是模型参数的两倍,这对大模型重要)。它也是“我该用哪个优化器?”在 90% 情况下的答案。

Deep Dive

Adam maintains two moving averages per parameter: the first moment (mean of gradients — momentum) and the second moment (mean of squared gradients — adaptive scaling). The update rule: parameter -= lr × m̂ / (√v̂ + ε), where m̂ and v̂ are bias-corrected moments. Parameters with consistently large gradients get smaller updates (they're already well-calibrated). Parameters with small, noisy gradients get larger updates (they need more aggressive movement).

AdamW: The Fix

The original Adam applied weight decay by adding it to the gradient before computing moments, which caused the decay to be scaled by the adaptive learning rate — not what you want. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient update, applying it directly to the parameters. This seems like a minor fix but significantly improves generalization. All modern LLM training uses AdamW.

Memory Cost

Adam stores two additional values per parameter (first and second moments), tripling the memory needed for optimizer state: a 70B model needs ~140 GB for weights (FP16) plus ~280 GB for Adam states (FP32), totaling ~420 GB. This is why optimizer state sharding (DeepSpeed ZeRO, FSDP) is essential for large model training. Some newer optimizers (Adafactor, CAME, Lion) reduce this memory overhead at some cost to stability.

Adam Optimizer

为什么重要

Deep Dive

AdamW: The Fix

Memory Cost

相关概念