llama.cpp: Definition & Meaning — AI Wiki

Consumer hardware पर LLM inference run करने के लिए एक open-source C/C++ library, Georgi Gerganov द्वारा created। llama.cpp quantized inference perform करती है, CUDA, PyTorch, या Python की ज़रूरत के बिना — ये CPUs, Apple Silicon, और consumer GPUs पर चलती है। ये पहला tool था जिसने large language models को locally run करना normal developers और enthusiasts के लिए accessible बनाया।

यह क्यों matter करता है

llama.cpp ने local AI revolution शुरू की। इसके पहले, एक language model run करने के लिए expensive NVIDIA GPUs और complex Python setups चाहिए होते थे। llama.cpp ने दिखाया कि quantized models एक MacBook या एक Raspberry Pi पर भी acceptable quality के साथ run हो सकते हैं। इसने एक पूरा ecosystem spawn किया (Ollama, LM Studio, kobold.cpp) और “self-hosted AI” को एक real option बना दिया।

Deep Dive

Gerganov released llama.cpp in March 2023, days after Meta released LLaMA. The initial version could run LLaMA-7B on a MacBook using 4-bit quantization — something previously considered impractical. The project grew rapidly, adding support for dozens of architectures (Mistral, Qwen, Phi, Gemma, Command-R), multiple quantization methods (GGML, then GGUF), and hardware acceleration for Metal (Apple), Vulkan (cross-platform GPU), and CUDA (NVIDIA).

Why C++ Matters

The choice of C/C++ was deliberate: no Python runtime, no PyTorch dependency, minimal system requirements. This enables deployment on embedded systems, mobile devices, and servers without GPU infrastructure. The binary is self-contained — download the executable, download a GGUF model file, and you're running. This simplicity is what enabled the local AI ecosystem to grow so quickly.

Server Mode

llama.cpp includes a server mode that exposes an OpenAI-compatible API, making it a drop-in replacement for cloud APIs in development. Many developers use llama.cpp server locally for development and testing, switching to cloud APIs only for production. This keeps development costs near zero and avoids sending sensitive data to external services during development.

llama.cpp

यह क्यों matter करता है

Deep Dive

Why C++ Matters

Server Mode

संबंधित अवधारणाएँ