Zubnet AI学习Wiki › vLLM
工具

vLLM

一个开源 LLM 服务引擎,通过 PagedAttention 和持续批处理实现高吞吐。vLLM 处理 GPU 内存管理、请求调度、KV cache 优化的复杂工程,提供 OpenAI 兼容的 API,让在生产中自托管开源模型(Llama、Mistral、Qwen)变得容易。

为什么重要

vLLM 是最流行的开源 LLM 服务方案。如果你在自托管开源模型,你很可能在用 vLLM(或者该用)。它的 PagedAttention 创新让服务吞吐比朴素实现提升 2–24 倍。它是让开源模型对生产使用实用的基础设施层。

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

相关概念

← 所有术语
← Vision Transformer Vocabulary →