Sparse Attention: Definition & Meaning — AI Wiki

只處理 token 對子集而不是完整 N×N 注意力矩陣的注意力機制。滑動視窗注意力只關注附近的 token(在固定視窗內)。稀疏模式(如 Longformer 的局部+全域組合)讓特定 token 關注一切,而大多數 token 局部關注。這些方法降低了注意力對長序列的平方級代價。

為什麼重要

稀疏注意力是 Mistral、Mixtral 等高效模型如何在不付出密集注意力全部代價的情況下處理長序列。它是「關注一切」(昂貴但徹底)和「不關注任何遠處」(便宜但有限)之間的實用折衷。理解稀疏注意力能幫你評估關於上下文長度的說法並預測品質退化可能發生的地方。

Deep Dive

Sliding window attention: each token attends only to tokens within a fixed window (e.g., 4096 tokens). Information from earlier tokens propagates through the layers — layer 1 sees 4096 tokens, layer 2 effectively sees 8192 (two windows worth), and by the final layer, information from the full sequence has had a chance to propagate. Mistral-7B uses a 4096-token sliding window across its 32 layers.

Hybrid Patterns

Longformer combines sliding window (local) attention with global attention on selected tokens (like [CLS] or user-defined positions). BigBird adds random attention connections on top of local and global patterns. These hybrid approaches let models handle 4K–16K tokens with subquadratic cost while maintaining the ability to connect distant tokens through global positions.

The Quality Trade-off

Sparse attention works well for many tasks but can degrade on tasks requiring precise long-range dependencies — referencing a specific fact from the beginning of a long document, maintaining consistency across a long conversation, or following complex instructions that span many tokens. Dense attention (full quadratic) with Flash Attention remains more robust for these cases, which is why most frontier models still use dense attention and rely on Flash Attention for efficiency rather than sparsity.

Sparse Attention

為什麼重要

Deep Dive

Hybrid Patterns

The Quality Trade-off

相關概念