Attention: Definition & Meaning — AI Wiki

El mecanismo central en los Transformers que permite a un modelo ponderar qué partes de la entrada son más relevantes entre sí. En lugar de leer el texto de izquierda a derecha como los modelos antiguos, la atención permite que cada palabra «mire» a cada otra palabra simultáneamente para entender el contexto.

Por qué importa

La atención es por qué los LLMs modernos entienden que «bank» significa cosas distintas en «river bank» vs. «bank account». También es por qué las ventanas de contexto más largas cuestan más — la atención escala cuadráticamente con la longitud de la secuencia.

Deep Dive

At its core, attention computes a weighted sum. For each token in a sequence, the mechanism asks: "How relevant is every other token to me right now?" It does this through three learned projections — queries, keys, and values (the Q, K, V you see in every paper). The query for one token is dot-producted against the keys of all tokens to produce a set of scores, those scores get softmaxed into weights, and the weights are used to blend the values into a context-aware representation. The entire operation is differentiable, so the model learns which relationships matter during training. Multi-head attention runs several of these in parallel with different projections, letting the model attend to different types of relationships simultaneously — one head might track syntax while another tracks coreference.

The Parallelism Breakthrough

The practical breakthrough of self-attention was parallelism. Recurrent networks like LSTMs processed tokens one at a time, which meant training was inherently sequential and slow. Attention processes the entire sequence in one shot, turning training into a massive matrix multiplication that GPUs devour. This is why Transformers could scale to billions of parameters and trillions of training tokens — the hardware was already built for exactly this kind of workload. Every major LLM you interact with today, from GPT-4 to Claude to Llama 3 to Mistral, owes its existence to this parallelism advantage.

The Quadratic Problem

The elephant in the room is quadratic scaling. Standard attention computes a score for every pair of tokens, so doubling your context window quadruples the computation and memory. A 4K context model uses 16 million attention scores per layer per head; jump to 128K and you are at 16 billion. This is why extending context windows has been such a massive engineering effort. Flash Attention (by Tri Dao) tackled the memory side by restructuring the computation to avoid materializing the full attention matrix in GPU HBM, making long contexts practical without changing the math. Grouped-query attention (GQA), used in Llama 2 and newer models, shares key-value heads across query heads to reduce the KV cache that builds up during generation.

Cross-attention is a variant worth understanding separately. In encoder-decoder models and in conditional generation (like text-to-image), the queries come from one sequence while the keys and values come from another. This is how Stable Diffusion conditions on your text prompt — the image-side queries attend to the text encoder outputs. It is also how the original Transformer handled translation: the decoder attended to the encoder outputs to decide what to generate next.

What Attention Is Not

A common misconception is that attention is "understanding." It is not. Attention is a routing mechanism — it decides where information flows, but the actual processing happens in the feedforward layers that follow each attention block. Research like the "Transformer Circuits" work from Anthropic has shown that attention heads develop specialized roles (induction heads, previous-token heads), but these are learned patterns, not programmed logic. Another practical gotcha: attention does not inherently know token order. Without positional encodings (sinusoidal, learned, or rotary like RoPE), it treats a sequence as a bag of tokens. Getting the positional encoding right has turned out to be critical for long-context performance, which is why approaches like ALiBi and RoPE keep evolving.

Attention

Por qué importa

Deep Dive

The Parallelism Breakthrough

The Quadratic Problem

What Attention Is Not

Conceptos relacionados