Zubnet AIसीखेंWiki › GQA
मूल सिद्धांत

GQA

Grouped Query Attention
एक attention variant जहाँ multiple query heads एक single key-value head share करते हैं, quality को significantly कम किए बिना KV cache size reduce करते हुए। हर query head के अपने K और V projections होने के बजाय (standard MHA), query heads के groups K और V projections share करते हैं। Llama 2 70B, Mistral, Gemma, और अधिकांश modern LLMs GQA use करते हैं।

यह क्यों matter करता है

GQA KV cache memory problem का practical solution है। 64 heads वाले standard multi-head attention को cache में per layer 64 sets of K और V tensors चाहिए। 8 KV heads वाला GQA इसे 8 sets तक reduce करता है — 8x memory reduction। ये directly same hardware पर ज़्यादा concurrent users serve करने या longer contexts handle करने में translate होता है।

Deep Dive

The spectrum: Multi-Head Attention (MHA) has equal numbers of Q, K, V heads — maximum quality, maximum memory. Multi-Query Attention (MQA) has many Q heads but only one K and one V head — minimum memory, some quality loss. GQA is the middle ground: divide Q heads into groups, each group sharing one K and one V head. A model with 32 Q heads and 8 KV groups has each KV head serving 4 Q heads.

Quality vs. Memory

Research shows that GQA with 8 KV heads matches MHA quality for most tasks while using 4–8x less KV cache memory. The quality preservation is somewhat surprising: it suggests that many attention heads are learning similar key-value patterns, so sharing them is efficient rather than limiting. Converting an existing MHA model to GQA through "uptraining" (a short fine-tuning phase) is also effective, avoiding the need to retrain from scratch.

Impact on Inference

The KV cache memory savings from GQA directly translate to: longer context windows on the same GPU, more concurrent requests (higher throughput), and faster attention computation (fewer K and V tensors to read). For a 70B model at 128K context, the difference between MHA and GQA can be hundreds of gigabytes of KV cache — the difference between needing 8 GPUs and needing 4.

संबंधित अवधारणाएँ

← सभी Terms
← GPU Gradient Checkpointing →