Zubnet AI學習Wiki › Dropout
Training

Dropout

Regularization, Weight Decay
一種正則化技術,在每個訓練步驟中隨機「關掉」一部分神經元,把它們的輸出置零。這防止網路過度依賴任何一個神經元,迫使它學習分散式的、健壯的表示。推理時所有神經元都是活躍的,但按比例縮放。

為什麼重要

Dropout 是對抗過擬合最簡單、最廣泛使用的防禦。沒有正則化,大的神經網路會記住訓練資料,而不是學可泛化的模式。Dropout(和它的兄弟 weight decay)是模型可以比訓練集大得多、而不僅僅記住一切的原因。

Deep Dive

The intuition: dropout trains an ensemble of sub-networks. Each training step uses a different random subset of neurons, effectively training a different architecture each time. At inference, using all neurons approximates averaging the predictions of all these sub-networks. This ensemble effect is what provides robustness — no single neuron can become a single point of failure.

Dropout in LLMs

Interestingly, many modern LLMs use little or no dropout during pre-training. At the scale of billions of parameters trained on trillions of tokens, overfitting is less of a concern because the model never sees the same data twice (or rarely). The training data is so vast relative to model capacity that the model is effectively always in the underfitting regime. Weight decay (L2 regularization) is more commonly used at this scale.

Variants

DropPath (stochastic depth) drops entire layers instead of individual neurons — used in Vision Transformers. DropConnect drops individual weights instead of neurons. Attention dropout drops attention weights to prevent the model from fixating on specific positions. Each variant addresses a different aspect of overfitting but shares the core idea: controlled randomness during training prevents over-specialization.

相關概念

← 所有術語
← DPO Dual Use →