Zubnet AI學習Wiki › Model Serving
基礎設施

Model Serving

vLLM, TGI, TensorRT-LLM, Inference Server
在生產環境裡執行訓練好的 AI 模型的基礎設施和軟體,處理進入的請求、管理 GPU 記憶體、為了效率批次處理、返回回應。像 vLLM、TGI(Text Generation Inference)、TensorRT-LLM 這些 model serving 框架處理讓 LLM 推理在規模上快速且成本高效的複雜工程。

為什麼重要

從「我有個模型」到「我能同時服務 1 萬個使用者」之間的鴻溝巨大。Model serving 框架解決 GPU 記憶體管理、請求排程、KV cache 優化、持續批次處理 — 這些從零開始解決很難的問題。選對服務堆疊是生產 AI 中槓桿最高的決定之一。

Deep Dive

vLLM (UC Berkeley) introduced PagedAttention — managing KV cache like virtual memory pages to eliminate fragmentation, achieving 2–4x higher throughput than naive implementations. TGI (Hugging Face) provides a production-ready server with built-in support for many model architectures, quantization, and streaming. TensorRT-LLM (NVIDIA) optimizes models specifically for NVIDIA GPUs using kernel fusion and custom CUDA kernels, often achieving the best single-GPU performance.

The Serving Stack

A production serving deployment typically includes: a model server (vLLM/TGI), a reverse proxy for load balancing (nginx), a request queue for traffic spikes, monitoring for latency and throughput metrics, and auto-scaling to add or remove GPU instances based on demand. Some deployments add a router that directs simple requests to smaller models and complex requests to larger ones, optimizing cost.

Self-Hosting vs. API

The decision between self-hosting (running your own model server) and using a provider's API depends on scale, privacy, and cost. Below ~$1,000/month in API costs, self-hosting rarely makes economic sense (GPU rental is expensive). Above ~$10,000/month, self-hosting often wins because you can optimize for your specific workload. Privacy requirements (data can't leave your infrastructure) often force self-hosting regardless of cost.

相關概念

← 所有術語
← Model Merging Moonshot AI →