Zubnet AIApprendreWiki › Encoder-Decoder
Models

Encoder-Decoder

Seq2Seq, Sequence-to-Sequence
Une architecture de modèle avec deux parties distinctes : un encoder qui lit et compresse l'entrée en une représentation, et un decoder qui génère la sortie à partir de cette représentation. Le papier original du Transformer décrivait un encoder-decoder. T5 et BART sont des modèles encoder-decoder. À l'inverse, GPT, Claude et Llama sont decoder-only (pas d'encoder), et BERT est encoder-only (pas de decoder).

Pourquoi c'est important

Comprendre encoder-decoder vs. decoder-only explique pourquoi différents modèles excellent dans différentes tâches. Les modèles encoder-decoder sont naturellement bons pour les tâches où tu transformes une séquence en une autre (traduction, résumé). Les modèles decoder-only sont meilleurs pour la génération ouverte. Tout le champ a convergé sur decoder-only pour les LLM, mais l'encoder-decoder est loin d'être mort.

Deep Dive

In an encoder-decoder Transformer, the encoder processes the full input using bidirectional self-attention — every token can see every other token. This creates a rich representation of the input. The decoder then generates output tokens autoregressively, attending to both the previously generated tokens (via masked self-attention) and the encoder's representations (via cross-attention). This cross-attention is the bridge between understanding and generation.

Decoder-Only Won

Modern LLMs (GPT, Claude, Llama, Gemini) are all decoder-only: there's no separate encoder, and the model uses causal (left-to-right) attention throughout. Why did decoder-only win? Simplicity and scaling. Encoder-decoder requires two separate attention mechanisms and the architecture introduces questions about how to split capacity between encoder and decoder. Decoder-only is uniform and scales cleanly. It also handles both understanding and generation in one architecture by treating every task as text generation.

Encoder-Only: BERT's Legacy

Encoder-only models like BERT use bidirectional attention (every token sees all other tokens) and are trained with masked language modeling. They can't generate text, but they produce excellent representations for classification, NER, semantic similarity, and search. Most embedding models used in RAG pipelines are encoder-only. They're smaller, faster, and cheaper than LLMs for tasks that don't require generation.

Concepts liés

← Tous les termes
← Emergence Endpoint →