Multi-Head Attention: Definition & Meaning — AI Wiki

并行运行多个 attention 操作,每个都有自己学到的 queries、keys、values 投影。不是一个 attention 函数看完整的模型维度,multi-head attention 把维度分成多个“头”(比如 4096 维的模型用 32 个头,每个 128 维)。每个头可以同时关注不同类型的关系。

为什么重要

Multi-head attention 是 Transformer 如此富有表达力的原因。一个头可能关注句法关系(主语-谓语),另一个关注位置模式(邻近词),另一个关注语义相似度。这种并行的专门化让模型能同时捕捉多种依赖,单个 attention 头做不到这么好。

Deep Dive

The mechanism: for each head i, the model learns separate projection matrices W_Q^i, W_K^i, W_V^i that project the input into a lower-dimensional space (head_dim = model_dim / num_heads). Each head independently computes attention: softmax(Q_i · K_i^T / √d) · V_i. The outputs of all heads are concatenated and projected back to the full model dimension through a final linear layer W_O.

Head Specialization

Research shows that different heads learn different functions. Some heads attend to the previous token (positional). Some attend to syntactically related tokens (subject to its verb). Some implement "induction" (pattern completion). Some attend broadly (gathering global context). Not all heads are equally important — pruning 20–40% of heads often has minimal impact on performance, suggesting significant redundancy.

GQA and MQA

Multi-Query Attention (MQA) uses a single key-value head shared across all query heads, reducing KV cache size by the number of heads. Grouped-Query Attention (GQA) is a middle ground: groups of query heads share a key-value head (e.g., 32 query heads with 8 KV heads). GQA preserves most of MHA's quality while dramatically reducing memory for KV cache. Llama 2 70B, Mistral, and most modern LLMs use GQA.

Multi-Head Attention

为什么重要

Deep Dive

Head Specialization

GQA and MQA

相关概念