To understand quantization, you need to understand what it's compressing. A neural network's "knowledge" is stored as billions of numerical parameters (weights), each one a floating-point number. During training, these are typically stored in FP32 (32-bit floating point) or BF16 (bfloat16, 16-bit). A 7-billion-parameter model in BF16 takes 7B × 2 bytes = 14GB of memory. Quantization reduces the precision of each weight — representing it with fewer bits. At INT8 (8-bit integer), that same model shrinks to ~7GB. At INT4 (4-bit), it's ~3.5GB. The key insight is that neural network weights are surprisingly redundant — you don't actually need 16 bits of precision to represent them usefully. Most weights cluster around zero and can be approximated with much coarser representations.
The main quantization formats you'll encounter each take different technical approaches. GPTQ (GPU-optimized, post-training quantization) was one of the first practical methods — it analyzes how weights interact during actual inference on calibration data and quantizes them in a way that minimizes error propagation. AWQ (Activation-aware Weight Quantization) improved on this by focusing on the small percentage of weights that matter most for model quality and protecting them with higher precision. GGUF is the format used by llama.cpp and is designed for flexible mixed-precision quantization on CPUs and GPUs alike. The naming convention in GGUF files tells you what you're getting: Q4_K_M means 4-bit quantization using the K-quant method at medium quality. Q5_K_M is 5-bit. Q2_K is aggressive 2-bit (noticeable quality loss). Q8_0 is 8-bit (nearly lossless).
The quality loss from quantization is real but often overstated. Going from BF16 to Q8 (8-bit) is essentially free — benchmarks typically show less than 0.5% degradation on standard evaluations. Q5_K_M still retains most of the model's capability and is often the sweet spot for local inference. Q4_K_M is where you start to notice subtle differences: the model might be slightly less precise with numbers, occasionally lose the thread on very long outputs, or be marginally worse at following complex instructions. Below 4-bit, quality degrades more noticeably — Q2 and Q3 quantizations can make a model noticeably dumber, especially on reasoning tasks. The general rule of thumb is: quantize down to Q4_K_M or Q5_K_M for everyday use, and only go lower if you literally cannot fit the model in your VRAM otherwise.
There's a second dimension to quantization that often gets overlooked: it doesn't just save memory, it makes inference faster. This is counterintuitive if you think of quantized math as "approximate" and therefore slower. But for LLM inference, the bottleneck during token generation is reading model weights from VRAM (memory bandwidth), not computing with them. A Q4 model has half the data to read compared to Q8, so tokens come out roughly twice as fast — assuming your inference engine properly supports quantized compute. This is why llama.cpp running a Q4_K_M model on a desktop GPU can sometimes match the tokens-per-second of a cloud API: the quantized model is genuinely more efficient per unit of hardware. The trade-off is always the same — you're exchanging some quality for speed and accessibility — but for many applications, it's a trade well worth making.
The latest development worth knowing about is quantization-aware training (QAT), where the model is trained with quantization in mind from the start rather than quantized after the fact. Meta released QAT versions of Llama models that outperform post-training-quantized equivalents at the same bit width. This approach produces models that are "born" to work at lower precision rather than having it imposed on them, and it's likely where the field is headed as local inference continues to grow.