Zubnet AIAprenderWiki › FLOPs
Fundamentos

FLOPs

Floating Point Operations, FLOP/s, Compute
Floating Point Operations — la medida estándar del trabajo computacional en IA. Entrenar un modelo requiere cierto número de FLOPs (operaciones totales). El hardware se califica en FLOP/s (operaciones por segundo). Un GPU H100 puede realizar ~2,000 TFLOP/s (2 mil billones de operaciones por segundo) en FP16. El entrenamiento de GPT-4 se estima en ~10^25 FLOPs — un número tan grande que es difícil comprenderlo.

Por qué importa

Los FLOPs son la moneda del compute IA. Las leyes de escalamiento se expresan en FLOPs. Los presupuestos de entrenamiento se miden en FLOPs. Las comparaciones de GPU usan FLOP/s. Entender los FLOPs te ayuda a estimar costos de entrenamiento, comparar hardware y entender por qué el progreso IA está tan estrechamente ligado al escalamiento de compute. Cuando la gente dice «escalar compute», quieren decir gastar más FLOPs.

Deep Dive

A useful approximation for Transformer training FLOPs: C ≈ 6 · N · D, where N is parameter count and D is tokens processed. The 6 comes from the forward pass (2x — multiply-add counts as 2 operations) plus the backward pass (roughly 2x forward). Training a 7B model on 1T tokens: 6 × 7×10^9 × 10^12 = 4.2×10^22 FLOPs. At 50% GPU utilization on H100s (~1000 TFLOP/s effective), that takes about 500 GPU-hours.

FLOPs vs. FLOP/s vs. GPU-Hours

FLOPs (without /s) is total work. FLOP/s is the rate. GPU-hours is time × hardware. They relate: GPU-hours = FLOPs / (FLOP/s × utilization). In practice, GPU utilization for LLM training is 30–60% (limited by communication, memory operations, and pipeline bubbles). This means an H100's theoretical 2000 TFLOP/s translates to 600–1200 TFLOP/s of actual useful work. Cost follows: at $2/GPU-hour, training that 7B model costs roughly $1,000.

Inference FLOPs

Inference FLOPs per token ≈ 2N (one forward pass). A 70B model: ~140 billion FLOPs per token. At 1000 TFLOP/s effective, that's 0.14ms per token — theoretically 7000 tokens/second. In practice, inference is usually memory-bandwidth-bound (reading 140GB of weights per token at 3TB/s takes 47ms), not compute-bound. This is the memory wall: the GPU can compute faster than it can read the model weights.

Conceptos relacionados

← Todos los términos
← Flash Attention Flow Matching →