Normalization: Definition & Meaning — AI Wiki

透過規範化流經網路的數值使其保持一致尺度來穩定神經網路訓練的技術。Layer Normalization(LayerNorm)在每個範例內部跨特徵規範化。RMSNorm 是簡化變體。Batch Normalization(BatchNorm)跨 batch 規範化。每個 Transformer 在層之間都用某種形式的規範化。

為什麼重要

沒有規範化,深層網路極難訓練 — 激活值可能在層與層之間爆炸或消失,讓梯度下降不穩定。規範化是那些不光鮮但絕對必要的技術之一:從任何現代架構裡把它拿掉,訓練就崩潰。

Deep Dive

LayerNorm (Ba et al., 2016) computes the mean and variance of all activations within a single training example and normalizes them to zero mean and unit variance, then applies learned scale and shift parameters. This ensures that regardless of the input magnitude, each layer receives inputs with a consistent distribution. It's the standard in Transformers.

RMSNorm: The Modern Default

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering and only normalizing by the root mean square: x / sqrt(mean(x²)). This is computationally cheaper (no need to compute mean for centering) and performs comparably. LLaMA, Mistral, and most modern LLMs use RMSNorm instead of LayerNorm.

Pre-Norm vs. Post-Norm

The original Transformer placed normalization after the attention/feed-forward block (post-norm). Modern architectures almost universally use pre-norm: normalize the input before passing it through the block, then add the residual. Pre-norm is more stable during training (especially at large scale) and allows training without learning rate warmup. This seemingly minor architectural choice has a significant impact on training stability.

Normalization

為什麼重要

Deep Dive

RMSNorm: The Modern Default

Pre-Norm vs. Post-Norm

相關概念