vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.
Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.
TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.