Model Serving: Definition & Meaning — AI Wiki

La infraestructura y software que corre modelos IA entrenados en producción, manejando requests entrantes, gestionando memoria GPU, haciendo batching para eficiencia, y devolviendo respuestas. Los frameworks de model serving como vLLM, TGI (Text Generation Inference) y TensorRT-LLM manejan la ingeniería compleja de hacer la inferencia LLM rápida y rentable a escala.

Por qué importa

La brecha entre «tengo un modelo» y «puedo servir a 10,000 usuarios simultáneamente» es enorme. Los frameworks de model serving resuelven gestión de memoria GPU, scheduling de requests, optimización de KV cache y continuous batching — problemas difíciles de resolver desde cero. Elegir el stack de serving correcto es una de las decisiones de mayor palanca en IA en producción.

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

Model Serving

Por qué importa

Deep Dive

The Serving Stack

Self-Hosting vs. API

Conceptos relacionados