llama.cpp: Definition & Meaning — AI Wiki

一个开源 C/C++ 库,用于在消费级硬件上运行 LLM 推理,由 Georgi Gerganov 创建。llama.cpp 执行量化推理,不需要 CUDA、PyTorch 或 Python — 它在 CPU、Apple Silicon 和消费级 GPU 上运行。它是第一个让普通开发者和爱好者能在本地运行大语言模型的工具。

为什么重要

llama.cpp 开启了本地 AI 革命。在它之前,运行语言模型需要昂贵的 NVIDIA GPU 和复杂的 Python 设置。llama.cpp 展示了量化模型可以在 MacBook 甚至 Raspberry Pi 上以可接受的质量运行。它催生了一整个生态(Ollama、LM Studio、kobold.cpp),让“self-hosted AI”成为真正的选项。

Deep Dive

Gerganov released llama.cpp in March 2023, days after Meta released LLaMA. The initial version could run LLaMA-7B on a MacBook using 4-bit quantization — something previously considered impractical. The project grew rapidly, adding support for dozens of architectures (Mistral, Qwen, Phi, Gemma, Command-R), multiple quantization methods (GGML, then GGUF), and hardware acceleration for Metal (Apple), Vulkan (cross-platform GPU), and CUDA (NVIDIA).

Why C++ Matters

The choice of C/C++ was deliberate: no Python runtime, no PyTorch dependency, minimal system requirements. This enables deployment on embedded systems, mobile devices, and servers without GPU infrastructure. The binary is self-contained — download the executable, download a GGUF model file, and you're running. This simplicity is what enabled the local AI ecosystem to grow so quickly.

Server Mode

llama.cpp includes a server mode that exposes an OpenAI-compatible API, making it a drop-in replacement for cloud APIs in development. Many developers use llama.cpp server locally for development and testing, switching to cloud APIs only for production. This keeps development costs near zero and avoids sending sensitive data to external services during development.

llama.cpp

为什么重要

Deep Dive

Why C++ Matters

Server Mode

相关概念