Model Serving: Definition & Meaning — AI Wiki

Production में trained AI models run करने वाली infrastructure और software, incoming requests handle करते हुए, GPU memory manage करते हुए, efficiency के लिए batching करते हुए, और responses return करते हुए। vLLM, TGI (Text Generation Inference), और TensorRT-LLM जैसे model serving frameworks LLM inference को scale पर fast और cost-effective बनाने की complex engineering handle करते हैं।

यह क्यों matter करता है

“मेरे पास एक model है” और “मैं 10,000 users को simultaneously serve कर सकता हूँ” के बीच का gap enormous है। Model serving frameworks GPU memory management, request scheduling, KV cache optimization, और continuous batching solve करते हैं — वो problems जो scratch से solve करना hard है। Right serving stack choose करना production AI में highest-leverage decisions में से एक है।

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

Model Serving

यह क्यों matter करता है

Deep Dive

The Serving Stack

Self-Hosting vs. API

संबंधित अवधारणाएँ