Zubnet AIAprenderWiki › Model Serving
Infraestrutura

Model Serving

vLLM, TGI, TensorRT-LLM, Inference Server
A infraestrutura e software que roda modelos IA treinados em produção, lidando com requests entrantes, gerenciando memória GPU, fazendo batching para eficiência, e retornando respostas. Frameworks de model serving como vLLM, TGI (Text Generation Inference) e TensorRT-LLM lidam com a engenharia complexa de tornar inferência LLM rápida e rentável em escala.

Por que importa

O gap entre “tenho um modelo” e “posso servir 10.000 usuários simultaneamente” é enorme. Frameworks de model serving resolvem gerenciamento de memória GPU, scheduling de requests, otimização de KV cache e continuous batching — problemas difíceis de resolver do zero. Escolher a stack de serving certa é uma das decisões de maior alavancagem em IA em produção.

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

Conceitos relacionados

← Todos os termos
← Model Merging Moonshot AI →