llama.cpp: Definition & Meaning — AI Wiki

一個開源 C/C++ 函式庫,用於在消費級硬體上執行 LLM 推理,由 Georgi Gerganov 創建。llama.cpp 執行量化推理,不需要 CUDA、PyTorch 或 Python — 它在 CPU、Apple Silicon 和消費級 GPU 上執行。它是第一個讓普通開發者和愛好者能在本地執行大語言模型的工具。

為什麼重要

llama.cpp 開啟了本地 AI 革命。在它之前,執行語言模型需要昂貴的 NVIDIA GPU 和複雜的 Python 設定。llama.cpp 展示了量化模型可以在 MacBook 甚至 Raspberry Pi 上以可接受的品質執行。它催生了一整個生態(Ollama、LM Studio、kobold.cpp),讓「self-hosted AI」成為真正的選項。

Deep Dive

Gerganov released llama.cpp in March 2023, days after Meta released LLaMA. The initial version could run LLaMA-7B on a MacBook using 4-bit quantization — something previously considered impractical. The project grew rapidly, adding support for dozens of architectures (Mistral, Qwen, Phi, Gemma, Command-R), multiple quantization methods (GGML, then GGUF), and hardware acceleration for Metal (Apple), Vulkan (cross-platform GPU), and CUDA (NVIDIA).

Why C++ Matters

The choice of C/C++ was deliberate: no Python runtime, no PyTorch dependency, minimal system requirements. This enables deployment on embedded systems, mobile devices, and servers without GPU infrastructure. The binary is self-contained — download the executable, download a GGUF model file, and you're running. This simplicity is what enabled the local AI ecosystem to grow so quickly.

Server Mode

llama.cpp includes a server mode that exposes an OpenAI-compatible API, making it a drop-in replacement for cloud APIs in development. Many developers use llama.cpp server locally for development and testing, switching to cloud APIs only for production. This keeps development costs near zero and avoids sending sensitive data to external services during development.

llama.cpp

為什麼重要

Deep Dive

Why C++ Matters

Server Mode

相關概念