Scaling Laws: Definition & Meaning — AI Wiki

Empirical relationships showing that model performance improves predictably as you increase model size, dataset size, and compute budget. The key finding: these improvements follow power laws — smooth, predictable curves that hold across many orders of magnitude. This means you can estimate how good a model will be before you spend millions training it.

Why it matters

Scaling laws are why AI progress has been so consistent. They turned model training from guesswork into engineering — companies can predict the compute budget needed for a target performance level. They also explain the AI arms race: if you know that 10x more compute gives a predictable improvement, the incentive to build bigger clusters is overwhelming.

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

Scaling Laws

Why it matters

Deep Dive

Chinchilla Changed the Game

Where Scaling Laws Break

Related Concepts