Zubnet AIAprenderWiki › vLLM
Ferramentas

vLLM

Um engine de serving LLM open-source que alcança alto throughput através de PagedAttention e batching contínuo. vLLM lida com a engenharia complexa de gerenciamento de memória GPU, agendamento de requests e otimização de KV cache, provendo uma API compatível com OpenAI que facilita o self-host de modelos abertos (Llama, Mistral, Qwen) em produção.

Por que importa

vLLM é a solução de serving LLM open-source mais popular. Se você está self-hostando um modelo aberto, provavelmente está usando vLLM (ou deveria). Sua inovação PagedAttention aumentou o throughput de serving em 2–24x comparado a implementações ingênuas. É a camada de infraestrutura que torna modelos abertos práticos para uso em produção.

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

Conceitos relacionados

← Todos os termos
← Vision Transformer Vocabulary →