Zubnet AI学习Wiki › Cross-Attention
基础

Cross-Attention

Encoder-Decoder Attention
queries 来自一个序列,keys/values 来自不同序列的注意力机制。在 encoder-decoder 模型里,decoder 的 queries attend 到 encoder 的 keys 和 values,让 decoder 在生成输出时“看”输入。交叉注意力也是文本如何在扩散模型中条件化图像生成 — 图像生成过程 attend 到文本 prompt。

为什么重要

交叉注意力是不同模态和架构不同部分之间的桥梁。它是翻译模型如何连接源语言和目标语言、图像生成器如何跟随文本 prompt、多模态模型如何把图像和文本联系起来、以及检索增强系统如何整合检索到的文档。每当两个不同的输入需要互动,交叉注意力通常都参与其中。

Deep Dive

In self-attention, Q, K, and V all come from the same sequence — each token attends to other tokens in the same input. In cross-attention, Q comes from one source (e.g., the decoder) and K, V come from another (e.g., the encoder). The decoder token asks "what in the input is relevant to what I'm generating right now?" and the attention mechanism provides a weighted summary of the input.

In Diffusion Models

Text-to-image models use cross-attention to condition image generation on text. The text prompt is encoded into embeddings (via CLIP or T5), and at each denoising step, the image features attend to the text embeddings through cross-attention layers. This is how the model knows to generate a "cat on a surfboard" — each spatial location in the image attends to the relevant words. Manipulating these cross-attention maps is how techniques like prompt weighting and attention editing work.

Attention Patterns

Self-attention and cross-attention have different computational profiles. Self-attention is quadratic in the sequence length (every token attends to every other token). Cross-attention is linear in the decoder length times the encoder length (each decoder token attends to all encoder tokens). In practice, the encoder output is often much shorter than the decoder sequence, making cross-attention cheaper than decoder self-attention.

相关概念

← 所有术语
← Cosine Similarity Curriculum 学习ing →