Zubnet AILearnWiki › Model Serving
Infrastructure

Model Serving

vLLM, TGI, TensorRT-LLM, Inference Server
The infrastructure and software that runs trained AI models in production, handling incoming requests, managing GPU memory, batching for efficiency, and returning responses. Model serving frameworks like vLLM, TGI (Text Generation Inference), and TensorRT-LLM handle the complex engineering of making LLM inference fast and cost-effective at scale.

Why it matters

The gap between "I have a model" and "I can serve 10,000 users simultaneously" is enormous. Model serving frameworks solve GPU memory management, request scheduling, KV cache optimization, and continuous batching — problems that are hard to solve from scratch. Choosing the right serving stack is one of the highest-leverage decisions in production AI.

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

Related Concepts

← All Terms
← Model Merging Moonshot AI →