Zubnet AILearnWiki › Quantization
Infrastructure

Quantization

Also known as: GGUF, GPTQ, AWQ
Reducing a model's precision to make it smaller and faster. A model trained in 32-bit floating point can be quantized to 8-bit, 4-bit, or even lower — shrinking its size by 4-8x with surprisingly small quality loss. GGUF is the popular format for local inference via llama.cpp.

Why it matters

Quantization is what makes it possible to run a 14B parameter model on a single GPU or even a laptop. Without it, open-weights models would be unusable for most people. The Q4_K_M and Q5_K_M variants hit the sweet spot of size vs. quality.

Related Concepts

← All Terms
← Prompt Engineering RAG →
ESC