Zubnet AIसीखेंWiki › PagedAttention
Infrastructure

PagedAttention

KV cache के लिए एक memory management technique जो operating system virtual memory से borrow करती है। हर request के KV cache के लिए GPU memory का एक contiguous block allocate करने के बजाय (जो fragmentation से memory waste करता है), PagedAttention cache को non-contiguous blocks (“pages”) में store करती है जो demand पर allocate होते हैं और common prefixes वाले requests के across share किए जा सकते हैं।

यह क्यों matter करता है

PagedAttention vLLM के पीछे का innovation है और अब अधिकांश LLM serving frameworks द्वारा adopted है। इसने fragmentation से memory waste eliminate करके serving throughput को naive implementations की तुलना में 2–4x बढ़ा दिया। इसके बिना, many concurrent users को long-context models serve करना dramatically ज़्यादा expensive होता।

Deep Dive

The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.

The Virtual Memory Analogy

PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.

Prefix Sharing

A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.

संबंधित अवधारणाएँ

← सभी Terms
← Overfitting Parameters →