vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.
A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.
The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.