Cross-Attention: परिभाषा और अर्थ — AI विकी

एक attention mechanism जहां queries एक sequence से आती हैं और keys/values एक अलग sequence से आते हैं। Encoder-decoder models में, decoder की queries encoder की keys और values पर attend करती हैं, decoder को output generate करते समय input को "देखने" की अनुमति देती हैं। Cross-attention इसी तरह text diffusion models में image generation को condition करता है — image generation process text prompt पर attend करता है।

यह क्यों मायने रखता है

Cross-attention विभिन्न modalities और architecture के विभिन्न भागों के बीच का सेतु है। इसी तरह translation models source और target languages को जोड़ते हैं, image generators text prompts का पालन करते हैं, multimodal models images को text से relate करते हैं, और Retrieval-Augmented systems retrieved documents को शामिल करते हैं। जब भी दो अलग-अलग inputs को interact करने की आवश्यकता होती है, cross-attention आमतौर पर शामिल होता है।

गहन अध्ययन

Self-attention में, Q, K, और V सभी एक ही sequence से आते हैं — प्रत्येक token उसी input में अन्य tokens पर attend करता है। Cross-attention में, Q एक source से (जैसे decoder) और K, V दूसरे से (जैसे encoder) आते हैं। Decoder token पूछता है "input में क्या relevant है जो मैं अभी generate कर रहा हूं?" और attention mechanism input का weighted summary प्रदान करता है।

Diffusion Models में

Text-to-image models image generation को text पर condition करने के लिए cross-attention का उपयोग करते हैं। Text prompt को embeddings में encode किया जाता है (CLIP या T5 के माध्यम से), और प्रत्येक denoising step पर, image features cross-attention layers के माध्यम से text embeddings पर attend करते हैं। इसी तरह model जानता है कि "a cat on a surfboard" generate करना है — image में प्रत्येक spatial location relevant words पर attend करता है। इन cross-attention maps में हेरफेर करना ही prompt weighting और attention editing जैसी techniques काम करती हैं।

Attention Patterns

Self-attention और cross-attention के अलग computational profiles हैं। Self-attention sequence length में quadratic है (हर token हर दूसरे token पर attend करता है)। Cross-attention decoder length times encoder length में linear है (प्रत्येक decoder token सभी encoder tokens पर attend करता है)। व्यवहार में, encoder output अक्सर decoder sequence से बहुत छोटा होता है, जिससे cross-attention decoder self-attention से सस्ता होता है।

Cross-Attention

यह क्यों मायने रखता है

गहन अध्ययन

Diffusion Models में

Attention Patterns

संबंधित अवधारणाएँ