Zubnet AIApprendreWiki › Scaling Laws
Fondamentaux

Scaling Laws

Neural Scaling Laws, Chinchilla Scaling
Des relations empiriques montrant que la performance du modèle s'améliore de façon prévisible pendant que tu augmentes la taille du modèle, la taille du dataset et le budget compute. Le constat clé : ces améliorations suivent des power laws — des courbes lisses et prévisibles qui tiennent à travers beaucoup d'ordres de grandeur. Ça veut dire que tu peux estimer à quel point un modèle sera bon avant de dépenser des millions à l'entraîner.

Pourquoi c'est important

Les scaling laws sont pourquoi le progrès IA a été si consistent. Elles ont transformé l'entraînement de modèle du guesswork en ingénierie — les compagnies peuvent prédire le budget compute nécessaire pour un niveau de performance cible. Elles expliquent aussi la course à l'armement IA : si tu sais que 10x plus de compute donne une amélioration prévisible, l'incentif à construire de plus gros clusters est écrasant.

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

Concepts liés

← Tous les termes
← Scale AI Self-Attention →