Zubnet AIAprenderWiki › vLLM
Herramientas

vLLM

Un engine de serving LLM open-source que logra alto throughput a través de PagedAttention y batching continuo. vLLM maneja la ingeniería compleja de gestión de memoria GPU, programación de requests y optimización de KV cache, proveyendo una API compatible con OpenAI que hace fácil self-hostear modelos abiertos (Llama, Mistral, Qwen) en producción.

Por qué importa

vLLM es la solución de serving LLM open-source más popular. Si estás self-hosteando un modelo abierto, probablemente estás usando vLLM (o deberías). Su innovación PagedAttention aumentó el throughput de serving en 2–24x comparado con implementaciones ingenuas. Es la capa de infraestructura que hace prácticos los modelos abiertos para uso en producción.

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

Conceptos relacionados

← Todos los términos
← Vision Transformer Vocabulary →