Zubnet AILearnWiki › Dropout
Training

Dropout

Regularization, Weight Decay
A regularization technique that randomly "turns off" a fraction of neurons during each training step by setting their outputs to zero. This prevents the network from relying too heavily on any single neuron, forcing it to learn distributed, robust representations. At inference time, all neurons are active but scaled accordingly.

Why it matters

Dropout is the simplest and most widely-used defense against overfitting. Without regularization, large neural networks memorize training data instead of learning generalizable patterns. Dropout (and its cousin weight decay) are why models can be much larger than their training sets without just memorizing everything.

Deep Dive

The intuition: dropout trains an ensemble of sub-networks. Each training step uses a different random subset of neurons, effectively training a different architecture each time. At inference, using all neurons approximates averaging the predictions of all these sub-networks. This ensemble effect is what provides robustness — no single neuron can become a single point of failure.

Dropout in LLMs

Interestingly, many modern LLMs use little or no dropout during pre-training. At the scale of billions of parameters trained on trillions of tokens, overfitting is less of a concern because the model never sees the same data twice (or rarely). The training data is so vast relative to model capacity that the model is effectively always in the underfitting regime. Weight decay (L2 regularization) is more commonly used at this scale.

Variants

DropPath (stochastic depth) drops entire layers instead of individual neurons — used in Vision Transformers. DropConnect drops individual weights instead of neurons. Attention dropout drops attention weights to prevent the model from fixating on specific positions. Each variant addresses a different aspect of overfitting but shares the core idea: controlled randomness during training prevents over-specialization.

Related Concepts

← All Terms
← DPO Dual Use →