Zubnet AI学习Wiki › Model Serving
基础设施

Model Serving

vLLM, TGI, TensorRT-LLM, Inference Server
在生产环境里运行训练好的 AI 模型的基础设施和软件,处理进入的请求、管理 GPU 内存、为了效率批处理、返回响应。像 vLLM、TGI(Text Generation Inference)、TensorRT-LLM 这些 model serving 框架处理让 LLM 推理在规模上快速且成本高效的复杂工程。

为什么重要

从“我有个模型”到“我能同时服务 1 万个用户”之间的鸿沟巨大。Model serving 框架解决 GPU 内存管理、请求调度、KV cache 优化、持续批处理 — 这些从零开始解决很难的问题。选对服务栈是生产 AI 中杠杆最高的决定之一。

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

相关概念

← 所有术语
← Model Merging Moonshot AI →