Weight Initialization: Definition & Meaning — AI Wiki

神經網路權重在訓練開始前如何設定。糟糕的初始化能讓訓練在開始前就失敗(激活消失或爆炸)。好的初始化確保激活和梯度跨層維持合理量級。Xavier 初始化(用於 tanh/sigmoid)和 Kaiming/He 初始化(用於 ReLU)是標準,每個都為激活函數校準。

為什麼重要

初始化看似小細節,但對訓練深度網路至關重要。隨機(太大)初始權重的網路產生爆炸的激活。權重太小的產生消失的激活。合適的初始化把網路放在「金髮姑娘區」,信號流通不爆炸也不消失 — 這是梯度下降能運作的先決條件。

Deep Dive

The core principle: initialize weights so that the variance of activations is approximately constant across layers. If each layer amplifies the signal (variance grows), activations explode. If each layer diminishes it (variance shrinks), activations vanish. Xavier initialization sets weights to variance 2/(fan_in + fan_out). Kaiming initialization sets variance 2/fan_in, accounting for the fact that ReLU zeros out half the values.

In Transformers

Modern Transformers often use a scaled initialization: output projection weights in attention and FFN layers are initialized with standard deviation scaled by 1/√(2×num_layers). This prevents the residual stream from growing too large as contributions from many layers accumulate. GPT-2 and many subsequent models use this "scaled init" approach. Some architectures (like muP/maximal update parameterization) take this further with mathematically derived scaling rules.

Pre-Trained Weights

For most practical purposes, initialization from scratch is rare — you start from pre-trained weights and fine-tune. But initialization still matters for the new components: LoRA adapters, new classification heads, or extended vocabulary embeddings. Zero initialization for LoRA's B matrix (so the adapter starts as identity) and proper initialization for new token embeddings (typically copying the mean of existing embeddings) are common patterns that prevent the new components from disrupting the pre-trained model at the start of fine-tuning.

Weight Initialization

為什麼重要

Deep Dive

In Transformers

Pre-Trained Weights

相關概念