Zubnet AIसीखेंWiki › Cross-Attention
मूल सिद्धांत

Cross-Attention

Encoder-Decoder Attention
एक attention mechanism जहाँ queries एक sequence से आती हैं और keys/values एक different sequence से। Encoder-decoder models में, decoder की queries encoder की keys और values पर attend करती हैं, decoder को output generate करते हुए input को “देखने” देते हुए। Cross-attention वो तरीका भी है जिससे text diffusion models में image generation को condition करता है — image generation process text prompt पर attend करती है।

यह क्यों matter करता है

Cross-attention different modalities और architecture के different parts के बीच का bridge है। ये वो तरीका है जिससे translation models source और target languages को connect करते हैं, image generators text prompts follow करते हैं, multimodal models images को text से relate करते हैं, और Retrieval-Augmented systems retrieved documents incorporate करते हैं। जब भी दो different inputs को interact करना हो, cross-attention usually involved होती है।

Deep Dive

In self-attention, Q, K, and V all come from the same sequence — each token attends to other tokens in the same input. In cross-attention, Q comes from one source (e.g., the decoder) and K, V come from another (e.g., the encoder). The decoder token asks "what in the input is relevant to what I'm generating right now?" and the attention mechanism provides a weighted summary of the input.

In Diffusion Models

Text-to-image models use cross-attention to condition image generation on text. The text prompt is encoded into embeddings (via CLIP or T5), and at each denoising step, the image features attend to the text embeddings through cross-attention layers. This is how the model knows to generate a "cat on a surfboard" — each spatial location in the image attends to the relevant words. Manipulating these cross-attention maps is how techniques like prompt weighting and attention editing work.

Attention Patterns

Self-attention and cross-attention have different computational profiles. Self-attention is quadratic in the sequence length (every token attends to every other token). Cross-attention is linear in the decoder length times the encoder length (each decoder token attends to all encoder tokens). In practice, the encoder output is often much shorter than the decoder sequence, making cross-attention cheaper than decoder self-attention.

संबंधित अवधारणाएँ

← सभी Terms
← Cosine Similarity Curriculum सीखेंing →