FLOPs: Definition & Meaning — AI Wiki

Floating Point Operations — AI 里计算工作的标准度量。训练一个模型需要一定数量的 FLOPs(总操作数)。硬件用 FLOP/s(每秒操作数)标定。H100 GPU 在 FP16 下可以做约 2,000 TFLOP/s(每秒 2 千万亿次操作)。GPT-4 的训练估计在 ~10^25 FLOPs — 一个大到难以理解的数字。

为什么重要

FLOPs 是 AI 计算的货币。scaling law 用 FLOPs 表达。训练预算用 FLOPs 衡量。GPU 比较用 FLOP/s。理解 FLOPs 能帮你估算训练成本、比较硬件、理解为什么 AI 进步和计算 scaling 如此紧密相连。当人们说“scaling compute”,他们的意思是花更多 FLOPs。

Deep Dive

A useful approximation for Transformer training FLOPs: C ≈ 6 · N · D, where N is parameter count and D is tokens processed. The 6 comes from the forward pass (2x — multiply-add counts as 2 operations) plus the backward pass (roughly 2x forward). Training a 7B model on 1T tokens: 6 × 7×10^9 × 10^12 = 4.2×10^22 FLOPs. At 50% GPU utilization on H100s (~1000 TFLOP/s effective), that takes about 500 GPU-hours.

FLOPs vs. FLOP/s vs. GPU-Hours

FLOPs (without /s) is total work. FLOP/s is the rate. GPU-hours is time × hardware. They relate: GPU-hours = FLOPs / (FLOP/s × utilization). In practice, GPU utilization for LLM training is 30–60% (limited by communication, memory operations, and pipeline bubbles). This means an H100's theoretical 2000 TFLOP/s translates to 600–1200 TFLOP/s of actual useful work. Cost follows: at $2/GPU-hour, training that 7B model costs roughly $1,000.

Inference FLOPs

Inference FLOPs per token ≈ 2N (one forward pass). A 70B model: ~140 billion FLOPs per token. At 1000 TFLOP/s effective, that's 0.14ms per token — theoretically 7000 tokens/second. In practice, inference is usually memory-bandwidth-bound (reading 140GB of weights per token at 3TB/s takes 47ms), not compute-bound. This is the memory wall: the GPU can compute faster than it can read the model weights.

FLOPs

为什么重要

Deep Dive

FLOPs vs. FLOP/s vs. GPU-Hours

Inference FLOPs

相关概念