Zubnet AI學習Wiki › vLLM
工具

vLLM

一個開源 LLM 服務引擎,透過 PagedAttention 和持續批次處理實現高吞吐。vLLM 處理 GPU 記憶體管理、請求排程、KV cache 優化的複雜工程,提供 OpenAI 相容的 API,讓在生產中自託管開源模型(Llama、Mistral、Qwen)變得容易。

為什麼重要

vLLM 是最流行的開源 LLM 服務方案。如果你在自託管開源模型,你很可能在用 vLLM(或者該用)。它的 PagedAttention 創新讓服務吞吐比樸素實作提升 2–24 倍。它是讓開源模型對生產使用實用的基礎設施層。

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

相關概念

← 所有術語
← Vision Transformer Vocabulary →