Zubnet AIAprenderWiki › Scaling Laws
Fundamentos

Scaling Laws

Neural Scaling Laws, Chinchilla Scaling
Relações empíricas mostrando que a performance do modelo melhora previsivelmente enquanto você aumenta o tamanho do modelo, tamanho do dataset e orçamento de compute. O achado-chave: essas melhorias seguem power laws — curvas suaves e previsíveis que se mantêm através de muitas ordens de magnitude. Isso significa que você pode estimar quão bom um modelo será antes de gastar milhões treinando-o.

Por que importa

Scaling laws são por que o progresso IA tem sido tão consistente. Transformaram o treinamento de modelos de adivinhação em engenharia — empresas podem prever o orçamento de compute necessário para um nível de performance alvo. Também explicam a corrida armamentista IA: se você sabe que 10x mais compute dá uma melhoria previsível, o incentivo para construir clusters maiores é esmagador.

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

Conceitos relacionados

← Todos os termos
← Scale AI Self-Attention →