Normalization: Definition & Meaning — AI Wiki

通过规范化流经网络的数值使其保持一致尺度来稳定神经网络训练的技术。Layer Normalization(LayerNorm)在每个样本内部跨特征规范化。RMSNorm 是简化变体。Batch Normalization(BatchNorm)跨 batch 规范化。每个 Transformer 在层之间都用某种形式的规范化。

为什么重要

没有规范化,深层网络极难训练 — 激活值可能在层与层之间爆炸或消失,让梯度下降不稳定。规范化是那些不光鲜但绝对必要的技术之一:从任何现代架构里把它拿掉,训练就崩溃。

Deep Dive

LayerNorm (Ba et al., 2016) computes the mean and variance of all activations within a single training example and normalizes them to zero mean and unit variance, then applies learned scale and shift parameters. This ensures that regardless of the input magnitude, each layer receives inputs with a consistent distribution. It's the standard in Transformers.

RMSNorm: The Modern Default

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering and only normalizing by the root mean square: x / sqrt(mean(x²)). This is computationally cheaper (no need to compute mean for centering) and performs comparably. LLaMA, Mistral, and most modern LLMs use RMSNorm instead of LayerNorm.

Pre-Norm vs. Post-Norm

The original Transformer placed normalization after the attention/feed-forward block (post-norm). Modern architectures almost universally use pre-norm: normalize the input before passing it through the block, then add the residual. Pre-norm is more stable during training (especially at large scale) and allows training without learning rate warmup. This seemingly minor architectural choice has a significant impact on training stability.

Normalization

为什么重要

Deep Dive

RMSNorm: The Modern Default

Pre-Norm vs. Post-Norm

相关概念