Zubnet AIसीखेंWiki › vLLM
Tools

vLLM

एक open-source LLM serving engine जो PagedAttention और continuous batching के through high throughput achieve करता है। vLLM GPU memory management, request scheduling, और KV cache optimization की complex engineering handle करता है, एक OpenAI-compatible API provide करते हुए जो open models (Llama, Mistral, Qwen) को production में self-host करना आसान बनाती है।

यह क्यों matter करता है

vLLM सबसे popular open-source LLM serving solution है। अगर आप एक open model self-host कर रहे हैं, आप probably vLLM use कर रहे हैं (या करना चाहिए)। इसकी PagedAttention innovation ने naive implementations की तुलना में serving throughput को 2–24x बढ़ा दिया। ये वो infrastructure layer है जो open models को production use के लिए practical बनाती है।

Deep Dive

vLLM (Kwon et al., UC Berkeley, 2023) introduced PagedAttention to LLM serving. Beyond PagedAttention, vLLM implements: continuous batching (adding new requests to running batches without waiting), prefix caching (sharing KV cache for common prompt prefixes), tensor parallelism (splitting models across multiple GPUs), and speculative decoding (using a draft model to speed up generation). These optimizations compose, delivering multiplicative speedups.

Usage

Deploying a model with vLLM is straightforward: vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 starts an OpenAI-compatible server on 4 GPUs. Applications connect using any OpenAI SDK by changing the base URL. This drop-in compatibility means you can prototype with OpenAI's API and switch to self-hosted vLLM without changing application code — just change the endpoint.

vLLM vs. Alternatives

TGI (Hugging Face) offers similar features with tighter Hugging Face ecosystem integration. TensorRT-LLM (NVIDIA) uses custom CUDA kernels for maximum single-GPU performance but requires NVIDIA hardware. SGLang (Berkeley) focuses on structured generation and complex prompting patterns. For most self-hosting scenarios, vLLM is the default choice due to its performance, broad model support, and active community. For maximum throughput on NVIDIA hardware specifically, TensorRT-LLM may edge it out.

संबंधित अवधारणाएँ

← सभी Terms
← Vision Transformer Vocabulary →