Zubnet AI學習Wiki › PagedAttention
基礎設施

PagedAttention

借鑒作業系統虛擬記憶體的 KV cache 記憶體管理技術。不是為每個請求的 KV cache 分配一整塊連續的 GPU 記憶體(透過碎片化浪費記憶體),PagedAttention 把 cache 存在非連續塊(「頁」)裡,按需分配,可以在有共同前綴的請求之間共享。

為什麼重要

PagedAttention 是 vLLM 背後的創新,現在被大多數 LLM 服務框架採納。透過消除碎片化造成的記憶體浪費,它把服務吞吐比樸素實作提高了 2–4 倍。沒有它,給許多並發使用者服務長上下文模型會貴得多。

Deep Dive

The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.

The Virtual Memory Analogy

PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.

Prefix Sharing

A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.

相關概念

← 所有術語
← Overfitting Parameters →