Zubnet AIAprenderWiki › Scaling Laws
Fundamentos

Scaling Laws

Neural Scaling Laws, Chinchilla Scaling
Relaciones empíricas mostrando que el rendimiento del modelo mejora predeciblemente mientras aumentas el tamaño del modelo, tamaño del dataset y presupuesto de compute. El hallazgo clave: estas mejoras siguen power laws — curvas suaves y predecibles que se mantienen a través de muchos órdenes de magnitud. Esto significa que puedes estimar qué tan bueno será un modelo antes de gastar millones entrenándolo.

Por qué importa

Las scaling laws son por qué el progreso IA ha sido tan consistente. Convirtieron el entrenamiento de modelos de adivinanza a ingeniería — las compañías pueden predecir el presupuesto de compute necesario para un nivel de rendimiento objetivo. También explican la carrera armamentista IA: si sabes que 10x más compute da una mejora predecible, el incentivo para construir clusters más grandes es abrumador.

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

Conceptos relacionados

← Todos los términos
← Scale AI Self-Attention →