Zubnet AI学习Wiki › PagedAttention
基础设施

PagedAttention

借鉴操作系统虚拟内存的 KV cache 内存管理技术。不是为每个请求的 KV cache 分配一整块连续的 GPU 内存(通过碎片化浪费内存),PagedAttention 把 cache 存在非连续块(“页”)里,按需分配,可以在有共同前缀的请求之间共享。

为什么重要

PagedAttention 是 vLLM 背后的创新,现在被大多数 LLM 服务框架采纳。通过消除碎片化造成的内存浪费,它把服务吞吐比朴素实现提高了 2–4 倍。没有它,给许多并发用户服务长上下文模型会贵得多。

Deep Dive

The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.

The Virtual Memory Analogy

PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.

Prefix Sharing

A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.

相关概念

← 所有术语
← Overfitting Parameters →