Zubnet AIसीखेंWiki › FLOPs
मूल सिद्धांत

FLOPs

Floating Point Operations, FLOP/s, Compute
Floating Point Operations — AI में computational work का standard measure। एक model train करने के लिए एक certain number of FLOPs (total operations) चाहिए। Hardware FLOP/s (operations per second) में rate होता है। एक H100 GPU FP16 में ~2,000 TFLOP/s (2 quadrillion operations per second) perform कर सकता है। GPT-4 की training का estimate ~10^25 FLOPs है — एक number इतना बड़ा कि समझना मुश्किल है।

यह क्यों matter करता है

FLOPs AI compute की currency हैं। Scaling laws FLOPs में express होते हैं। Training budgets FLOPs में measure होते हैं। GPU comparisons FLOP/s use करते हैं। FLOPs समझना आपको training costs estimate करने, hardware compare करने, और ये समझने में help करता है कि AI progress compute scaling के साथ इतनी closely tied क्यों है। जब लोग “scaling compute” बोलते हैं, उनका मतलब ज़्यादा FLOPs spend करना है।

Deep Dive

A useful approximation for Transformer training FLOPs: C ≈ 6 · N · D, where N is parameter count and D is tokens processed. The 6 comes from the forward pass (2x — multiply-add counts as 2 operations) plus the backward pass (roughly 2x forward). Training a 7B model on 1T tokens: 6 × 7×10^9 × 10^12 = 4.2×10^22 FLOPs. At 50% GPU utilization on H100s (~1000 TFLOP/s effective), that takes about 500 GPU-hours.

FLOPs vs. FLOP/s vs. GPU-Hours

FLOPs (without /s) is total work. FLOP/s is the rate. GPU-hours is time × hardware. They relate: GPU-hours = FLOPs / (FLOP/s × utilization). In practice, GPU utilization for LLM training is 30–60% (limited by communication, memory operations, and pipeline bubbles). This means an H100's theoretical 2000 TFLOP/s translates to 600–1200 TFLOP/s of actual useful work. Cost follows: at $2/GPU-hour, training that 7B model costs roughly $1,000.

Inference FLOPs

Inference FLOPs per token ≈ 2N (one forward pass). A 70B model: ~140 billion FLOPs per token. At 1000 TFLOP/s effective, that's 0.14ms per token — theoretically 7000 tokens/second. In practice, inference is usually memory-bandwidth-bound (reading 140GB of weights per token at 3TB/s takes 47ms), not compute-bound. This is the memory wall: the GPU can compute faster than it can read the model weights.

संबंधित अवधारणाएँ

← सभी Terms
← Flash Attention Flow Matching →