Zubnet AI學習Wiki › Mixed Precision Training
Training

Mixed Precision Training

FP16, BF16, Half Precision
訓練神經網路時大多數運算用較低精度數字格式(16-bit 而不是 32-bit),同時把關鍵操作保持在全精度。這使 GPU 的有效記憶體容量和運算速度翻倍,對模型品質影響最小。BF16(bfloat16)是 LLM 訓練的標準;FP16 用於推理。

為什麼重要

Mixed precision 就是為什麼我們能訓練像現在這麼大的模型。一個 FP32 的 70B 參數模型光權重就要 280 GB — 任何單 GPU 都不可能。在 BF16 下,只需要 140 GB,能分攤到幾個 GPU 上。mixed precision 實際上免費把 AI 產業的算力翻了一倍,只是透過用更聰明的數字格式。

Deep Dive

The key insight: most neural network computations don't need 32 bits of precision. The weights, activations, and gradients can be represented in 16 bits without meaningful quality loss. But some operations (loss computation, weight updates) need higher precision to avoid numerical instability. Mixed precision keeps a master copy of weights in FP32 for updates, while using FP16/BF16 for the forward and backward passes.

BF16 vs. FP16

FP16 (IEEE half-precision) has 5 exponent bits and 10 mantissa bits. BF16 (Brain Float 16) has 8 exponent bits and 7 mantissa bits. BF16's wider exponent range means it can represent the same range of values as FP32 (avoiding overflow/underflow), while FP16's narrower range requires loss scaling to prevent gradients from underflowing to zero. For training, BF16 is simpler and more stable. For inference, FP16 sometimes offers slightly better precision for the same memory cost.

FP8 and Beyond

The latest GPUs (NVIDIA H100, H200) support FP8 (8-bit floating point) for even faster computation. FP8 halves memory and doubles throughput compared to FP16, but requires careful handling to avoid quality degradation. Current practice: train in BF16, serve in FP16 or FP8, and quantize to INT4/INT8 for edge deployment. Each step down in precision trades a tiny amount of quality for significant speed and memory gains.

相關概念

← 所有術語
← Mistral AI Mixture of Experts →