Normalization: Definition & Meaning — AI Wiki

Techniques that stabilize neural network training by normalizing the values flowing through the network to have consistent scale. Layer Normalization (LayerNorm) normalizes across features within each example. RMSNorm is a simplified variant. Batch Normalization (BatchNorm) normalizes across the batch. Every Transformer uses some form of normalization between layers.

Why it matters

Without normalization, deep networks are extremely difficult to train — activations can explode or vanish across layers, making gradient descent unstable. Normalization is one of those unglamorous techniques that is absolutely essential: remove it from any modern architecture and training collapses.

Deep Dive

LayerNorm (Ba et al., 2016) computes the mean and variance of all activations within a single training example and normalizes them to zero mean and unit variance, then applies learned scale and shift parameters. This ensures that regardless of the input magnitude, each layer receives inputs with a consistent distribution. It's the standard in Transformers.

RMSNorm: The Modern Default

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering and only normalizing by the root mean square: x / sqrt(mean(x²)). This is computationally cheaper (no need to compute mean for centering) and performs comparably. LLaMA, Mistral, and most modern LLMs use RMSNorm instead of LayerNorm.

Pre-Norm vs. Post-Norm

The original Transformer placed normalization after the attention/feed-forward block (post-norm). Modern architectures almost universally use pre-norm: normalize the input before passing it through the block, then add the residual. Pre-norm is more stable during training (especially at large scale) and allows training without learning rate warmup. This seemingly minor architectural choice has a significant impact on training stability.

Normalization

Why it matters

Deep Dive

RMSNorm: The Modern Default

Pre-Norm vs. Post-Norm

Related Concepts