Zubnet AI学习Wiki › Scaling Laws
基础

Scaling Laws

Neural Scaling Laws, Chinchilla Scaling
显示模型性能随你增加模型大小、数据集大小、算力预算而可预测地提升的经验关系。关键发现:这些提升遵循幂律 — 跨越许多数量级都成立的平滑、可预测的曲线。这意味着你能在花数百万训练之前估算一个模型会有多好。

为什么重要

Scaling law 就是为什么 AI 进展这么一致。它们把模型训练从猜测变成工程 — 公司能预测目标性能水平需要的算力预算。它们也解释 AI 军备竞赛:如果你知道 10 倍算力给可预测的提升,建更大集群的动力就无法抗拒。

Deep Dive

The foundational paper (Kaplan et al., 2020, OpenAI) showed that loss decreases as a power law with model parameters (N), dataset size (D), and compute (C), and that these three factors are roughly interchangeable within limits. Double the parameters, you need to roughly double the data to match. The relationship holds across 7+ orders of magnitude, which is remarkable for an empirical finding.

Chinchilla Changed the Game

The Chinchilla paper (Hoffmann et al., 2022, DeepMind) refined these laws with a crucial practical insight: most models were dramatically undertrained. GPT-3 had 175B parameters trained on 300B tokens, but Chinchilla showed that a 70B model trained on 1.4T tokens (20x the data, 2.5x fewer parameters) performed better. The "Chinchilla optimal" ratio is roughly 20 tokens per parameter. This shifted the industry from "bigger model" to "more data" and directly influenced Llama, Mistral, and most modern training runs.

Where Scaling Laws Break

Scaling laws predict average loss, not specific capabilities. A model might follow the predicted loss curve perfectly while suddenly gaining (or losing) specific abilities at certain scales — these are emergent abilities. Scaling laws also don't account for data quality: a model trained on 1T tokens of curated text outperforms one trained on 2T tokens of noisy web crawl, even though the latter uses more data. And post-training (RLHF, DPO) can dramatically change a model's usefulness without changing its pre-training loss. Scaling laws are a powerful tool, not the whole story.

相关概念

← 所有术语
← Scale AI Self-Attention →