BERT: Definition & Meaning — AI Wiki

Un modèle basé sur Transformer de Google (2018) qui a révolutionné le NLP en introduisant le pré-entraînement bidirectionnel — chaque token peut s'appliquer à chaque autre token, donnant au modèle une compréhension contextuelle profonde. BERT est un modèle encoder-only : il excelle à comprendre le texte (classification, recherche, NER) mais ne peut pas générer du texte comme GPT ou Claude.

Pourquoi c'est important

BERT est le papier NLP le plus influent de l'ère moderne. Il a prouvé que le pré-entraînement sur du texte non étiqueté puis le fine-tuning sur des tâches spécifiques pouvait écraser chaque benchmark existant. Même si les LLM ont volé la vedette, les modèles de style BERT alimentent encore la plupart des moteurs de recherche en production, des systèmes d'embedding et des pipelines de classification parce qu'ils sont plus petits, plus rapides et moins chers que les LLM pour les tâches non génératives.

Deep Dive

BERT's training uses two objectives: Masked Language Modeling (MLM) — randomly mask 15% of tokens and predict them from context — and Next Sentence Prediction (NSP) — predict whether two sentences are consecutive. MLM forces bidirectional understanding because the model must use both left and right context to predict masked words. This is fundamentally different from GPT's left-to-right approach.

Why BERT Still Matters

In the LLM era, BERT-family models (RoBERTa, DeBERTa, DistilBERT) remain the backbone of production NLP. They're 100x smaller than LLMs (110M–340M parameters vs. billions), 10x faster for inference, and often better for tasks that don't require generation. Most embedding models used in RAG and semantic search are BERT descendants. Google Search used BERT extensively before transitioning to larger models.

BERT vs. GPT: The Architecture Split

BERT (encoder-only, bidirectional) and GPT (decoder-only, left-to-right) represent two philosophies. BERT sees the whole input at once — perfect for understanding. GPT sees only what came before — perfect for generating. The field initially thought encoder-decoder (T5) would win by combining both. Instead, decoder-only (GPT approach) won for LLMs because it scales more cleanly, and you can approximate bidirectional understanding through clever prompting.

BERT

Pourquoi c'est important

Deep Dive

Why BERT Still Matters

BERT vs. GPT: The Architecture Split

Concepts liés