Self-Attention: Definition & Meaning — AI Wiki

一種注意力機制,序列對自己進行注意 — 每個 token 計算自己與同一序列中每個其他 token 的相關性。queries、keys 和 values 都來自同一個輸入。這讓每個 token 能從所有其他 token 中收集資訊,按相關性加權。Self-attention 是每個 Transformer 層的核心操作。

為什麼重要

Self-attention 是讓 Transformer 運作的原因。它用所有位置之間的平行、直接連接,取代了 RNN 的循序處理。「river bank」中的「bank」可以 attend 到「river」來解析它的意思,不管它們距離多遠。這種直接連接任意兩個位置的能力,就是為什麼 Transformer 能那麼好地處理長距離依賴。

Deep Dive

The computation: for input X, compute Q = X·W_Q, K = X·W_K, V = X·W_V. Then: Attention(Q,K,V) = softmax(Q·K^T / √d_k) · V. The softmax(Q·K^T) produces an N×N attention matrix where entry (i,j) represents how much token i attends to token j. The √d_k scaling prevents dot products from growing too large in high dimensions, which would push softmax into saturated regions with near-zero gradients.

Causal vs. Bidirectional

In decoder-only LLMs (GPT, Claude, Llama), self-attention is causal: each token can only attend to previous tokens (including itself). This is enforced by a causal mask that sets future positions to −∞ before softmax. In encoder models (BERT), self-attention is bidirectional: every token attends to every other token. The causal constraint is what makes autoregressive generation possible — the model can't "peek" at future tokens.

The Quadratic Cost

Self-attention computes an N×N attention matrix, making it O(N²) in both time and memory. For a 128K token context, that's ~16 billion entries per layer per head. This quadratic scaling is the fundamental limitation that drives research into sparse attention, linear attention, Flash Attention (which reduces memory but not compute), and SSMs (which avoid the N×N matrix entirely). Every approach to long-context modeling is ultimately about managing this quadratic cost.

Self-Attention

為什麼重要

Deep Dive

Causal vs. Bidirectional

The Quadratic Cost

相關概念