PagedAttention: Definition & Meaning — AI Wiki

Uma técnica de gerenciamento de memória para KV cache que empresta da memória virtual de sistemas operacionais. Em vez de alocar um bloco contíguo de memória GPU para o KV cache de cada request (o que desperdiça memória por fragmentação), PagedAttention armazena o cache em blocos não-contíguos (“páginas”) que são alocados sob demanda e podem ser compartilhados entre requests com prefixos comuns.

Por que importa

PagedAttention é a inovação por trás do vLLM e agora é adotada pela maioria dos frameworks de serving LLM. Aumentou o throughput de serving em 2–4x comparado a implementações ingênuas ao eliminar desperdício de memória por fragmentação. Sem ela, servir modelos long-context a muitos usuários concorrentes seria dramaticamente mais caro.

Deep Dive

The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.

The Virtual Memory Analogy

PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.

Prefix Sharing

A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.

PagedAttention

Por que importa

Deep Dive

The Virtual Memory Analogy

Prefix Sharing

Conceitos relacionados