Model Serving: Definition & Meaning — AI Wiki

L'infrastructure et le logiciel qui fait tourner les modèles IA entraînés en production, gérant les requêtes entrantes, la mémoire GPU, le batching pour l'efficacité et retournant les réponses. Les frameworks de model serving comme vLLM, TGI (Text Generation Inference) et TensorRT-LLM gèrent l'ingénierie complexe de rendre l'inférence LLM rapide et rentable à l'échelle.

Pourquoi c'est important

L'écart entre « j'ai un modèle » et « je peux servir 10 000 utilisateurs simultanément » est énorme. Les frameworks de model serving résolvent la gestion mémoire GPU, le scheduling de requêtes, l'optimisation du KV cache et le continuous batching — des problèmes durs à résoudre from scratch. Choisir la bonne stack de serving est une des décisions à plus haut levier en IA en production.

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

Model Serving

Pourquoi c'est important

Deep Dive

The Serving Stack

Self-Hosting vs. API

Concepts liés