A useful approximation for Transformer training FLOPs: C ≈ 6 · N · D, where N is parameter count and D is tokens processed. The 6 comes from the forward pass (2x — multiply-add counts as 2 operations) plus the backward pass (roughly 2x forward). Training a 7B model on 1T tokens: 6 × 7×10^9 × 10^12 = 4.2×10^22 FLOPs. At 50% GPU utilization on H100s (~1000 TFLOP/s effective), that takes about 500 GPU-hours.
FLOPs (without /s) is total work. FLOP/s is the rate. GPU-hours is time × hardware. They relate: GPU-hours = FLOPs / (FLOP/s × utilization). In practice, GPU utilization for LLM training is 30–60% (limited by communication, memory operations, and pipeline bubbles). This means an H100's theoretical 2000 TFLOP/s translates to 600–1200 TFLOP/s of actual useful work. Cost follows: at $2/GPU-hour, training that 7B model costs roughly $1,000.
Inference FLOPs per token ≈ 2N (one forward pass). A 70B model: ~140 billion FLOPs per token. At 1000 TFLOP/s effective, that's 0.14ms per token — theoretically 7000 tokens/second. In practice, inference is usually memory-bandwidth-bound (reading 140GB of weights per token at 3TB/s takes 47ms), not compute-bound. This is the memory wall: the GPU can compute faster than it can read the model weights.