Zubnet AIAprenderWiki › Dropout
Training

Dropout

Regularization, Weight Decay
Una técnica de regularización que aleatoriamente «apaga» una fracción de las neuronas durante cada paso de entrenamiento poniendo sus salidas en cero. Esto previene que la red dependa demasiado de una sola neurona, forzándola a aprender representaciones distribuidas y robustas. En inferencia, todas las neuronas están activas pero escaladas acordemente.

Por qué importa

El dropout es la defensa más simple y más usada contra el overfitting. Sin regularización, las redes neuronales grandes memorizan los datos de entrenamiento en vez de aprender patrones generalizables. El dropout (y su primo el weight decay) son por qué los modelos pueden ser mucho más grandes que sus conjuntos de entrenamiento sin simplemente memorizar todo.

Deep Dive

The intuition: dropout trains an ensemble of sub-networks. Each training step uses a different random subset of neurons, effectively training a different architecture each time. At inference, using all neurons approximates averaging the predictions of all these sub-networks. This ensemble effect is what provides robustness — no single neuron can become a single point of failure.

Dropout in LLMs

Interestingly, many modern LLMs use little or no dropout during pre-training. At the scale of billions of parameters trained on trillions of tokens, overfitting is less of a concern because the model never sees the same data twice (or rarely). The training data is so vast relative to model capacity that the model is effectively always in the underfitting regime. Weight decay (L2 regularization) is more commonly used at this scale.

Variants

DropPath (stochastic depth) drops entire layers instead of individual neurons — used in Vision Transformers. DropConnect drops individual weights instead of neurons. Attention dropout drops attention weights to prevent the model from fixating on specific positions. Each variant addresses a different aspect of overfitting but shares the core idea: controlled randomness during training prevents over-specialization.

Conceptos relacionados

← Todos los términos
← DPO Dual Use →