Zubnet AILearnWiki › vLLM
Tools

vLLM

An open-source LLM serving engine that achieves high throughput through PagedAttention and continuous batching. vLLM handles the complex engineering of GPU memory management, request scheduling, and KV cache optimization, providing an OpenAI-compatible API that makes it easy to self-host open models (Llama, Mistral, Qwen) in production.

Why it matters

vLLM is the most popular open-source LLM serving solution. If you're self-hosting an open model, you're probably using vLLM (or should be). Its PagedAttention innovation increased serving throughput by 2–24x compared to naive implementations. It's the infrastructure layer that makes open models practical for production use.

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

Related Concepts

← All Terms
← Vision Transformer Vocabulary →