The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.
PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.
A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.