Zubnet AIAprenderWiki › PagedAttention
Infraestructura

PagedAttention

Una técnica de gestión de memoria para KV cache que toma prestado de memoria virtual de sistemas operativos. En vez de asignar un bloque contiguo de memoria GPU para el KV cache de cada request (lo cual desperdicia memoria por fragmentación), PagedAttention almacena el cache en bloques no contiguos («páginas») que se asignan a demanda y pueden compartirse entre requests con prefijos comunes.

Por qué importa

PagedAttention es la innovación detrás de vLLM y ahora es adoptada por la mayoría de frameworks de serving LLM. Aumentó el throughput de serving en 2–4x comparado con implementaciones ingenuas al eliminar el desperdicio de memoria por fragmentación. Sin ella, servir modelos long-context a muchos usuarios concurrentes sería dramáticamente más caro.

Deep Dive

The problem PagedAttention solves: when a request arrives, you don't know how long the response will be, so you must pre-allocate KV cache for the maximum possible length. If max length is 4096 tokens but the response is only 200 tokens, 95% of the allocated memory is wasted. Multiply by hundreds of concurrent requests and GPU memory fills up fast, limiting throughput.

The Virtual Memory Analogy

PagedAttention divides KV cache into fixed-size pages (e.g., 16 tokens per page). Pages are allocated only when needed and can be stored anywhere in GPU memory (non-contiguous). A page table maps logical positions to physical memory locations, just like OS virtual memory. This eliminates fragmentation: memory is allocated page-by-page as the response grows, and freed pages are immediately available for new requests.

Prefix Sharing

A powerful extension: when multiple requests share the same prompt prefix (common with shared system prompts), their KV cache pages for that prefix can be physically shared — stored once in memory but referenced by all requests. This is copy-on-write semantics from OS design applied to LLM serving. For applications where many users share the same system prompt, this can reduce memory usage by 50%+ for the shared portion.

Conceptos relacionados

← Todos los términos
← Overfitting Parameters →