GGUF: Definition & Meaning — AI Wiki

透過 llama.cpp、Ollama、以及其他本地推理工具在本地執行量化語言模型的標準檔案格式。GGUF 檔案包含量化格式的模型權重(把精度從 16-bit 降到 4-bit 或 8-bit),以及詞彙表、架構詳細資訊、量化參數等元資料 — 載入並執行模型所需的一切都在單個檔案裡。

為什麼重要

GGUF 是讓本地 AI 變實用的格式。在它之前,本地跑模型需要 PyTorch、CUDA、特定 GPU 記憶體的複雜配置。GGUF 把一切打包成一個檔案,llama.cpp 或 Ollama 能直接載入 — 在 CPU、Apple Silicon、遊戲 GPU 上,任何地方。如果你在 Hugging Face 看到檔名像「Q4_K_M.gguf」的模型,那就是一個本地可用的模型。

Deep Dive

GGUF succeeded GGML (the original format), adding a more extensible metadata system and support for new quantization types. A typical model release includes multiple GGUF variants at different quantization levels: Q2_K (smallest, lowest quality), Q4_K_M (popular sweet spot), Q5_K_M (better quality, larger), Q6_K, Q8_0 (near-original quality, largest). The naming convention tells you the bit-width and quantization method.

Quantization Variants

The "K" in Q4_K_M refers to k-quant methods that use different bit-widths for different layers based on their sensitivity — attention layers might get higher precision than feed-forward layers. The "M" means "medium" (between "S" for small/aggressive and "L" for large/conservative). Q4_K_M typically preserves 95%+ of the original model quality while reducing file size by 4x compared to FP16. For most users, Q4_K_M or Q5_K_M is the right choice.

The Ecosystem

GGUF has become the lingua franca of local AI. Community members quantize new models to GGUF within hours of release and upload them to Hugging Face. 工具 like llama.cpp, Ollama, LM Studio, GPT4All, and kobold.cpp all support GGUF natively. This ecosystem is why you can download a 70B model at 4-bit quantization (about 40 GB) and run it on a MacBook Pro with 64 GB RAM in under a minute from download to first response.

GGUF

為什麼重要

Deep Dive

Quantization Variants

The Ecosystem

相關概念