Zubnet AI学习Wiki › Quantization
基础设施

Quantization

又名: GGUF, GPTQ, AWQ
降低模型精度以让它更小更快。一个用 32-bit 浮点训练的模型可以被量化到 8-bit、4-bit,甚至更低 — 尺寸缩小 4-8 倍,而质量损失惊人地小。GGUF 是通过 llama.cpp 本地推理的流行格式。

为什么重要

量化就是让一个 140 亿参数模型能在单个 GPU 甚至一台 laptop 上跑起来的原因。没有它,open-weights 模型对大多数人来说就无法使用。Q4_K_M 和 Q5_K_M 变体击中了大小 vs. 质量的甜蜜点。

Deep Dive

To understand quantization, you need to understand what it's compressing. A neural network's "knowledge" is stored as billions of numerical parameters (weights), each one a floating-point number. During training, these are typically stored in FP32 (32-bit floating point) or BF16 (bfloat16, 16-bit). A 7-billion-parameter model in BF16 takes 7B × 2 bytes = 14GB of memory. Quantization reduces the precision of each weight — representing it with fewer bits. At INT8 (8-bit integer), that same model shrinks to ~7GB. At INT4 (4-bit), it's ~3.5GB. The key insight is that neural network weights are surprisingly redundant — you don't actually need 16 bits of precision to represent them usefully. Most weights cluster around zero and can be approximated with much coarser representations.

Formats and Methods

The main quantization formats you'll encounter each take different technical approaches. GPTQ (GPU-optimized, post-training quantization) was one of the first practical methods — it analyzes how weights interact during actual inference on calibration data and quantizes them in a way that minimizes error propagation. AWQ (Activation-aware Weight Quantization) improved on this by focusing on the small percentage of weights that matter most for model quality and protecting them with higher precision. GGUF is the format used by llama.cpp and is designed for flexible mixed-precision quantization on CPUs and GPUs alike. The naming convention in GGUF files tells you what you're getting: Q4_K_M means 4-bit quantization using the K-quant method at medium quality. Q5_K_M is 5-bit. Q2_K is aggressive 2-bit (noticeable quality loss). Q8_0 is 8-bit (nearly lossless).

Quality vs. Compression

The quality loss from quantization is real but often overstated. Going from BF16 to Q8 (8-bit) is essentially free — benchmarks typically show less than 0.5% degradation on standard evaluations. Q5_K_M still retains most of the model's capability and is often the sweet spot for local inference. Q4_K_M is where you start to notice subtle differences: the model might be slightly less precise with numbers, occasionally lose the thread on very long outputs, or be marginally worse at following complex instructions. Below 4-bit, quality degrades more noticeably — Q2 and Q3 quantizations can make a model noticeably dumber, especially on reasoning tasks. The general rule of thumb is: quantize down to Q4_K_M or Q5_K_M for everyday use, and only go lower if you literally cannot fit the model in your VRAM otherwise.

The Speed Bonus

There's a second dimension to quantization that often gets overlooked: it doesn't just save memory, it makes inference faster. This is counterintuitive if you think of quantized math as "approximate" and therefore slower. But for LLM inference, the bottleneck during token generation is reading model weights from VRAM (memory bandwidth), not computing with them. A Q4 model has half the data to read compared to Q8, so tokens come out roughly twice as fast — assuming your inference engine properly supports quantized compute. This is why llama.cpp running a Q4_K_M model on a desktop GPU can sometimes match the tokens-per-second of a cloud API: the quantized model is genuinely more efficient per unit of hardware. The trade-off is always the same — you're exchanging some quality for speed and accessibility — but for many applications, it's a trade well worth making.

Born Quantized

The latest development worth knowing about is quantization-aware training (QAT), where the model is trained with quantization in mind from the start rather than quantized after the fact. Meta released QAT versions of Llama models that outperform post-training-quantized equivalents at the same bit width. This approach produces models that are "born" to work at lower precision rather than having it imposed on them, and it's likely where the field is headed as local inference continues to grow.

相关概念

← 所有术语
← Pruning RAG →
ESC