Zubnet AI學習Wiki › Scaling Laws
基礎

Scaling Laws

Neural Scaling Laws, Chinchilla Scaling
顯示模型性能隨你增加模型大小、資料集大小、算力預算而可預測地提升的經驗關係。關鍵發現:這些提升遵循冪律 — 跨越許多數量級都成立的平滑、可預測的曲線。這意味著你能在花數百萬訓練之前估算一個模型會有多好。

為什麼重要

Scaling law 就是為什麼 AI 進展這麼一致。它們把模型訓練從猜測變成工程 — 公司能預測目標性能水平需要的算力預算。它們也解釋 AI 軍備競賽:如果你知道 10 倍算力給可預測的提升,建更大叢集的動力就無法抗拒。

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

相關概念

← 所有術語
← Scale AI Self-Attention →