Zubnet AI学习Wiki › GQA
基础

GQA

Grouped Query Attention
一种注意力变体,多个 query head 共享单个 key-value head,减少 KV cache 大小而不显著减少质量。不是每个 query head 有自己的 K 和 V 投影(标准 MHA),而是 query head 的分组共享 K 和 V 投影。Llama 2 70B、Mistral、Gemma 和大多数现代 LLM 都用 GQA。

为什么重要

GQA 是 KV cache 内存问题的实际解决方案。64 个头的标准多头注意力每层需要 cache 里 64 组 K 和 V 张量。8 个 KV 头的 GQA 把这减少到 8 组 — 8 倍内存减少。这直接意味着同样硬件能服务更多并发用户或处理更长上下文。

Deep Dive

The spectrum: Multi-Head Attention (MHA) has equal numbers of Q, K, V heads — maximum quality, maximum memory. Multi-Query Attention (MQA) has many Q heads but only one K and one V head — minimum memory, some quality loss. GQA is the middle ground: divide Q heads into groups, each group sharing one K and one V head. A model with 32 Q heads and 8 KV groups has each KV head serving 4 Q heads.

Quality vs. Memory

Research shows that GQA with 8 KV heads matches MHA quality for most tasks while using 4–8x less KV cache memory. The quality preservation is somewhat surprising: it suggests that many attention heads are learning similar key-value patterns, so sharing them is efficient rather than limiting. Converting an existing MHA model to GQA through "uptraining" (a short fine-tuning phase) is also effective, avoiding the need to retrain from scratch.

Impact on Inference

The KV cache memory savings from GQA directly translate to: longer context windows on the same GPU, more concurrent requests (higher throughput), and faster attention computation (fewer K and V tensors to read). For a 70B model at 128K context, the difference between MHA and GQA can be hundreds of gigabytes of KV cache — the difference between needing 8 GPUs and needing 4.

相关概念

← 所有术语
← GPU Gradient Checkpointing →