Zubnet AI學習Wiki › Encoder-Decoder
Models

Encoder-Decoder

Seq2Seq, Sequence-to-Sequence
一種有兩個獨立部分的模型架構:encoder 讀取並把輸入壓縮成一個表示,decoder 從這個表示生成輸出。原版 Transformer 論文描述的就是 encoder-decoder。T5 和 BART 是 encoder-decoder 模型。相反,GPT、Claude、Llama 是 decoder-only(沒有 encoder),BERT 是 encoder-only(沒有 decoder)。

為什麼重要

理解 encoder-decoder vs. decoder-only 能解釋為什麼不同模型擅長不同任務。Encoder-decoder 模型天然適合把一個序列轉換成另一個的任務(翻譯、摘要)。Decoder-only 模型更擅長開放式生成。整個領域在 LLM 上收斂到了 decoder-only,但 encoder-decoder 遠沒有死。

Deep Dive

In an encoder-decoder Transformer, the encoder processes the full input using bidirectional self-attention — every token can see every other token. This creates a rich representation of the input. The decoder then generates output tokens autoregressively, attending to both the previously generated tokens (via masked self-attention) and the encoder's representations (via cross-attention). This cross-attention is the bridge between understanding and generation.

Decoder-Only Won

Modern LLMs (GPT, Claude, Llama, Gemini) are all decoder-only: there's no separate encoder, and the model uses causal (left-to-right) attention throughout. Why did decoder-only win? Simplicity and scaling. Encoder-decoder requires two separate attention mechanisms and the architecture introduces questions about how to split capacity between encoder and decoder. Decoder-only is uniform and scales cleanly. It also handles both understanding and generation in one architecture by treating every task as text generation.

Encoder-Only: BERT's Legacy

Encoder-only models like BERT use bidirectional attention (every token sees all other tokens) and are trained with masked language modeling. They can't generate text, but they produce excellent representations for classification, NER, semantic similarity, and search. Most embedding models used in RAG pipelines are encoder-only. They're smaller, faster, and cheaper than LLMs for tasks that don't require generation.

相關概念

← 所有術語
← Emergence Endpoint →