Sparse Attention: परिभाषा और अर्थ — AI विकी

Attention mechanisms जो पूरे N×N attention matrix के बजाय token pairs के केवल एक subset को process करते हैं। Sliding window attention केवल nearby tokens (एक fixed window के भीतर) पर attend करता है। Sparse patterns (जैसे Longformer का local + global attention combination) विशिष्ट tokens को सब कुछ पर attend करने देते हैं जबकि अधिकांश tokens locally attend करते हैं। ये approaches long sequences के लिए attention की quadratic cost को कम करते हैं।

यह क्यों मायने रखता है

Sparse attention इसी तरह Mistral, Mixtral, और अन्य efficient models long sequences को full dense attention की cost के बिना संभालते हैं। यह "सब पर attend करो" (expensive लेकिन thorough) और "दूर किसी चीज़ पर attend मत करो" (cheap लेकिन limited) के बीच का व्यावहारिक समझौता है। Sparse attention को समझने से आपको context length और quality degradation कहां हो सकती है इसके दावों का मूल्यांकन करने में मदद मिलती है।

गहन अध्ययन

Sliding window attention: प्रत्येक token केवल एक fixed window (जैसे 4096 tokens) के भीतर tokens पर attend करता है। पहले के tokens से जानकारी layers के माध्यम से propagate होती है — layer 1 4096 tokens देखता है, layer 2 effectively 8192 (दो windows) देखता है, और final layer तक, पूरे sequence से जानकारी propagate होने का मौका मिल जाता है। Mistral-7B अपनी 32 layers में 4096-token sliding window का उपयोग करता है।

Hybrid Patterns

Longformer sliding window (local) attention को selected tokens (जैसे [CLS] या user-defined positions) पर global attention के साथ combine करता है। BigBird local और global patterns के ऊपर random attention connections जोड़ता है। ये hybrid approaches models को subquadratic cost के साथ 4K–16K tokens संभालने देते हैं जबकि global positions के माध्यम से distant tokens connect करने की ability बनाए रखते हैं।

Quality Trade-off

Sparse attention कई tasks के लिए अच्छे काम करता है लेकिन precise long-range dependencies की आवश्यकता वाले tasks पर degrade हो सकता है — लंबे document की शुरुआत से specific fact reference करना, लंबी conversation में consistency बनाए रखना, या कई tokens में span करने वाले complex instructions follow करना। Dense attention (full quadratic) Flash Attention के साथ इन cases के लिए अधिक robust रहता है, यही कारण है कि अधिकांश frontier models अभी भी dense attention का उपयोग करते हैं और sparsity के बजाय efficiency के लिए Flash Attention पर rely करते हैं।

Sparse Attention

यह क्यों मायने रखता है

गहन अध्ययन

Hybrid Patterns

Quality Trade-off

संबंधित अवधारणाएँ