Zubnet AILearnWiki › Encoder-Decoder
Models

Encoder-Decoder

Seq2Seq, Sequence-to-Sequence
A model architecture with two distinct parts: an encoder that reads and compresses the input into a representation, and a decoder that generates the output from that representation. The original Transformer paper described an encoder-decoder. T5 and BART are encoder-decoder models. In contrast, GPT/Claude/Llama are decoder-only (no encoder), and BERT is encoder-only (no decoder).

Why it matters

Understanding encoder-decoder vs. decoder-only explains why different models excel at different tasks. Encoder-decoder models are naturally good at tasks where you transform one sequence into another (translation, summarization). Decoder-only models are better at open-ended generation. The entire field converged on decoder-only for LLMs, but encoder-decoder is far from dead.

Deep Dive

In an encoder-decoder Transformer, the encoder processes the full input using bidirectional self-attention — every token can see every other token. This creates a rich representation of the input. The decoder then generates output tokens autoregressively, attending to both the previously generated tokens (via masked self-attention) and the encoder's representations (via cross-attention). This cross-attention is the bridge between understanding and generation.

Decoder-Only Won

Modern LLMs (GPT, Claude, Llama, Gemini) are all decoder-only: there's no separate encoder, and the model uses causal (left-to-right) attention throughout. Why did decoder-only win? Simplicity and scaling. Encoder-decoder requires two separate attention mechanisms and the architecture introduces questions about how to split capacity between encoder and decoder. Decoder-only is uniform and scales cleanly. It also handles both understanding and generation in one architecture by treating every task as text generation.

Encoder-Only: BERT's Legacy

Encoder-only models like BERT use bidirectional attention (every token sees all other tokens) and are trained with masked language modeling. They can't generate text, but they produce excellent representations for classification, NER, semantic similarity, and search. Most embedding models used in RAG pipelines are encoder-only. They're smaller, faster, and cheaper than LLMs for tasks that don't require generation.

Related Concepts

← All Terms
← Emergence Endpoint →