Zubnet AIसीखेंWiki › Adam Optimizer
Training

Adam Optimizer

Adam, AdamW
Neural networks train करने के लिए सबसे widely used optimization algorithm। Adam (Adaptive Moment Estimation) momentum (past gradients का running average use करते हुए) को adaptive learning rates (past gradient magnitudes के inverse से updates scale करते हुए) के साथ combine करता है। AdamW better regularization के लिए decoupled weight decay add करता है। लगभग हर modern LLM AdamW से trained है।

यह क्यों matter करता है

Adam tasks और hyperparameters की एक wide range के across अच्छा काम करता है, इसे default optimizer बनाते हुए। इसे समझना explain करता है कि training अधिकांश time “just works” क्यों (Adam per-parameter adapt करता है) और क्यों कभी-कभी नहीं (Adam की memory requirements model के parameters की 2x हैं, जो large models के लिए matter करता है)। ये 90% cases में “मुझे कौन सा optimizer use करना चाहिए?” का answer भी है।

Deep Dive

Adam maintains two moving averages per parameter: the first moment (mean of gradients — momentum) and the second moment (mean of squared gradients — adaptive scaling). The update rule: parameter -= lr × m̂ / (√v̂ + ε), where m̂ and v̂ are bias-corrected moments. Parameters with consistently large gradients get smaller updates (they're already well-calibrated). Parameters with small, noisy gradients get larger updates (they need more aggressive movement).

AdamW: The Fix

The original Adam applied weight decay by adding it to the gradient before computing moments, which caused the decay to be scaled by the adaptive learning rate — not what you want. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient update, applying it directly to the parameters. This seems like a minor fix but significantly improves generalization. All modern LLM training uses AdamW.

Memory Cost

Adam stores two additional values per parameter (first and second moments), tripling the memory needed for optimizer state: a 70B model needs ~140 GB for weights (FP16) plus ~280 GB for Adam states (FP32), totaling ~420 GB. This is why optimizer state sharding (DeepSpeed ZeRO, FSDP) is essential for large model training. Some newer optimizers (Adafactor, CAME, Lion) reduce this memory overhead at some cost to stability.

संबंधित अवधारणाएँ

← सभी Terms
ESC