Zubnet AIApprendreWiki › vLLM
Outils

vLLM

Un engine de serving LLM open-source qui atteint un haut débit à travers le PagedAttention et le continuous batching. vLLM gère l'ingénierie complexe de la gestion mémoire GPU, le scheduling de requêtes et l'optimisation du KV cache, fournissant une API compatible OpenAI qui rend facile le self-host de modèles ouverts (Llama, Mistral, Qwen) en production.

Pourquoi c'est important

vLLM est la solution de serving LLM open-source la plus populaire. Si tu self-hostes un modèle ouvert, tu utilises probablement vLLM (ou tu devrais). Son innovation PagedAttention a augmenté le débit de serving de 2–24x comparé aux implémentations naïves. C'est la couche d'infrastructure qui rend les modèles ouverts pratiques pour l'usage en production.

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

Concepts liés

← Tous les termes
← Vision Transformer Vocabulary →