Zubnet AI学习Wiki › Dropout
Training

Dropout

Regularization, Weight Decay
一种正则化技术,在每个训练步骤中随机“关掉”一部分神经元,把它们的输出置零。这防止网络过度依赖任何一个神经元,迫使它学习分布式的、鲁棒的表示。推理时所有神经元都是活跃的,但按比例缩放。

为什么重要

Dropout 是对抗过拟合最简单、最广泛使用的防御。没有正则化,大的神经网络会记住训练数据,而不是学可泛化的模式。Dropout(和它的兄弟 weight decay)是模型可以比训练集大得多、而不仅仅记住一切的原因。

Deep Dive

The intuition: dropout trains an ensemble of sub-networks. Each training step uses a different random subset of neurons, effectively training a different architecture each time. At inference, using all neurons approximates averaging the predictions of all these sub-networks. This ensemble effect is what provides robustness — no single neuron can become a single point of failure.

Dropout in LLMs

Interestingly, many modern LLMs use little or no dropout during pre-training. At the scale of billions of parameters trained on trillions of tokens, overfitting is less of a concern because the model never sees the same data twice (or rarely). The training data is so vast relative to model capacity that the model is effectively always in the underfitting regime. Weight decay (L2 regularization) is more commonly used at this scale.

Variants

DropPath (stochastic depth) drops entire layers instead of individual neurons — used in Vision Transformers. DropConnect drops individual weights instead of neurons. Attention dropout drops attention weights to prevent the model from fixating on specific positions. Each variant addresses a different aspect of overfitting but shares the core idea: controlled randomness during training prevents over-specialization.

相关概念

← 所有术语
← DPO Dual Use →