Zubnet AIAprenderWiki › Normalization
Training

Normalization

LayerNorm, RMSNorm, BatchNorm
Técnicas que estabilizan el entrenamiento de redes neuronales normalizando los valores que fluyen a través de la red para tener escala consistente. Layer Normalization (LayerNorm) normaliza a través de features dentro de cada ejemplo. RMSNorm es una variante simplificada. Batch Normalization (BatchNorm) normaliza a través del batch. Cada Transformer usa alguna forma de normalización entre sus capas.

Por qué importa

Sin normalización, las redes profundas son extremadamente difíciles de entrenar — las activaciones pueden explotar o desvanecerse a través de las capas, haciendo inestable el gradient descent. La normalización es una de esas técnicas poco glamorosas que son absolutamente esenciales: quítala de cualquier arquitectura moderna y el entrenamiento colapsa.

Deep Dive

LayerNorm (Ba et al., 2016) computes the mean and variance of all activations within a single training example and normalizes them to zero mean and unit variance, then applies learned scale and shift parameters. This ensures that regardless of the input magnitude, each layer receives inputs with a consistent distribution. It's the standard in Transformers.

RMSNorm: The Modern Default

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering and only normalizing by the root mean square: x / sqrt(mean(x²)). This is computationally cheaper (no need to compute mean for centering) and performs comparably. LLaMA, Mistral, and most modern LLMs use RMSNorm instead of LayerNorm.

Pre-Norm vs. Post-Norm

The original Transformer placed normalization after the attention/feed-forward block (post-norm). Modern architectures almost universally use pre-norm: normalize the input before passing it through the block, then add the residual. Pre-norm is more stable during training (especially at large scale) and allows training without learning rate warmup. This seemingly minor architectural choice has a significant impact on training stability.

Conceptos relacionados

← Todos los términos
← Neuron NVIDIA →