Zubnet AILearnWiki › Attention
Models

Attention

Also known as: Attention Mechanism, Self-Attention
The core mechanism in Transformers that lets a model weigh which parts of the input are most relevant to each other. Instead of reading text left-to-right like older models, attention lets every word "look at" every other word simultaneously to understand context.

Why it matters

Attention is why modern LLMs understand that "bank" means different things in "river bank" vs. "bank account." It's also why longer context windows cost more — attention scales quadratically with sequence length.

Deep Dive

At its core, attention computes a weighted sum. For each token in a sequence, the mechanism asks: "How relevant is every other token to me right now?" It does this through three learned projections — queries, keys, and values (the Q, K, V you see in every paper). The query for one token is dot-producted against the keys of all tokens to produce a set of scores, those scores get softmaxed into weights, and the weights are used to blend the values into a context-aware representation. The entire operation is differentiable, so the model learns which relationships matter during training. Multi-head attention runs several of these in parallel with different projections, letting the model attend to different types of relationships simultaneously — one head might track syntax while another tracks coreference.

The Parallelism Breakthrough

The practical breakthrough of self-attention was parallelism. Recurrent networks like LSTMs processed tokens one at a time, which meant training was inherently sequential and slow. Attention processes the entire sequence in one shot, turning training into a massive matrix multiplication that GPUs devour. This is why Transformers could scale to billions of parameters and trillions of training tokens — the hardware was already built for exactly this kind of workload. Every major LLM you interact with today, from GPT-4 to Claude to Llama 3 to Mistral, owes its existence to this parallelism advantage.

The Quadratic Problem

The elephant in the room is quadratic scaling. Standard attention computes a score for every pair of tokens, so doubling your context window quadruples the computation and memory. A 4K context model uses 16 million attention scores per layer per head; jump to 128K and you are at 16 billion. This is why extending context windows has been such a massive engineering effort. Flash Attention (by Tri Dao) tackled the memory side by restructuring the computation to avoid materializing the full attention matrix in GPU HBM, making long contexts practical without changing the math. Grouped-query attention (GQA), used in Llama 2 and newer models, shares key-value heads across query heads to reduce the KV cache that builds up during generation.

Cross-attention is a variant worth understanding separately. In encoder-decoder models and in conditional generation (like text-to-image), the queries come from one sequence while the keys and values come from another. This is how Stable Diffusion conditions on your text prompt — the image-side queries attend to the text encoder outputs. It is also how the original Transformer handled translation: the decoder attended to the encoder outputs to decide what to generate next.

What Attention Is Not

A common misconception is that attention is "understanding." It is not. Attention is a routing mechanism — it decides where information flows, but the actual processing happens in the feedforward layers that follow each attention block. Research like the "Transformer Circuits" work from Anthropic has shown that attention heads develop specialized roles (induction heads, previous-token heads), but these are learned patterns, not programmed logic. Another practical gotcha: attention does not inherently know token order. Without positional encodings (sinusoidal, learned, or rotary like RoPE), it treats a sequence as a bag of tokens. Getting the positional encoding right has turned out to be critical for long-context performance, which is why approaches like ALiBi and RoPE keep evolving.

Related Concepts

← All Terms
← AssemblyAI Automation →
ESC