Zubnet AIAprenderWiki › Weight Initialization
Training

Weight Initialization

Xavier Init, Kaiming Init, He Init
Cómo se establecen los pesos de red neuronal antes de que empiece el entrenamiento. Una mala inicialización puede hacer fallar el entrenamiento antes de que empiece (activaciones que desaparecen o explotan). Una buena inicialización asegura que las activaciones y gradientes mantengan magnitudes razonables a través de capas. La inicialización Xavier (para tanh/sigmoid) y Kaiming/He (para ReLU) son los estándares, cada una calibrada a la función de activación.

Por qué importa

La inicialización parece un detalle menor pero es crítica para entrenar redes profundas. Una red con pesos iniciales aleatorios (demasiado grandes) produce activaciones que explotan. Una con pesos demasiado pequeños produce activaciones que desaparecen. Una inicialización apropiada pone la red en una «zona Ricitos de Oro» donde las señales fluyen sin explotar o desaparecer — un prerrequisito para que el gradient descent funcione.

Deep Dive

The core principle: initialize weights so that the variance of activations is approximately constant across layers. If each layer amplifies the signal (variance grows), activations explode. If each layer diminishes it (variance shrinks), activations vanish. Xavier initialization sets weights to variance 2/(fan_in + fan_out). Kaiming initialization sets variance 2/fan_in, accounting for the fact that ReLU zeros out half the values.

In Transformers

Modern Transformers often use a scaled initialization: output projection weights in attention and FFN layers are initialized with standard deviation scaled by 1/√(2×num_layers). This prevents the residual stream from growing too large as contributions from many layers accumulate. GPT-2 and many subsequent models use this "scaled init" approach. Some architectures (like muP/maximal update parameterization) take this further with mathematically derived scaling rules.

Pre-Trained Weights

For most practical purposes, initialization from scratch is rare — you start from pre-trained weights and fine-tune. But initialization still matters for the new components: LoRA adapters, new classification heads, or extended vocabulary embeddings. Zero initialization for LoRA's B matrix (so the adapter starts as identity) and proper initialization for new token embeddings (typically copying the mean of existing embeddings) are common patterns that prevent the new components from disrupting the pre-trained model at the start of fine-tuning.

Conceptos relacionados

← Todos los términos
ESC