Zubnet AI學習Wiki › Cross-Attention
基礎

Cross-Attention

Encoder-Decoder Attention
queries 來自一個序列,keys/values 來自不同序列的注意力機制。在 encoder-decoder 模型裡,decoder 的 queries attend 到 encoder 的 keys 和 values,讓 decoder 在生成輸出時「看」輸入。交叉注意力也是文字如何在擴散模型中條件化影像生成 — 影像生成過程 attend 到文字 prompt。

為什麼重要

交叉注意力是不同模態和架構不同部分之間的橋樑。它是翻譯模型如何連接源語言和目標語言、影像生成器如何跟隨文字 prompt、多模態模型如何把影像和文字聯繫起來、以及檢索增強系統如何整合檢索到的文件。每當兩個不同的輸入需要互動,交叉注意力通常都參與其中。

Deep Dive

In self-attention, Q, K, and V all come from the same sequence — each token attends to other tokens in the same input. In cross-attention, Q comes from one source (e.g., the decoder) and K, V come from another (e.g., the encoder). The decoder token asks "what in the input is relevant to what I'm generating right now?" and the attention mechanism provides a weighted summary of the input.

In Diffusion Models

Text-to-image models use cross-attention to condition image generation on text. The text prompt is encoded into embeddings (via CLIP or T5), and at each denoising step, the image features attend to the text embeddings through cross-attention layers. This is how the model knows to generate a "cat on a surfboard" — each spatial location in the image attends to the relevant words. Manipulating these cross-attention maps is how techniques like prompt weighting and attention editing work.

Attention Patterns

Self-attention and cross-attention have different computational profiles. Self-attention is quadratic in the sequence length (every token attends to every other token). Cross-attention is linear in the decoder length times the encoder length (each decoder token attends to all encoder tokens). In practice, the encoder output is often much shorter than the decoder sequence, making cross-attention cheaper than decoder self-attention.

相關概念

← 所有術語
← Cosine Similarity Curriculum 學習ing →