BERT: Definition & Meaning — AI Wiki

Google(2018)基于 Transformer 的模型,通过引入双向预训练革新了 NLP — 每个 token 能 attend 到每个其他 token,给模型深度上下文理解。BERT 是 encoder-only 模型:它擅长理解文本(分类、搜索、NER),但不能像 GPT 或 Claude 那样生成文本。

为什么重要

BERT 是现代最有影响力的 NLP 论文。它证明了在未标注文本上预训练再在特定任务上 fine-tune 能压垮每个现有 benchmark。即便 LLM 抢了风头,BERT 风格的模型仍然驱动大多数生产中的搜索引擎、embedding 系统、分类管线,因为它们对非生成任务来说比 LLM 更小、更快、更便宜。

Deep Dive

BERT's training uses two objectives: Masked Language Modeling (MLM) — randomly mask 15% of tokens and predict them from context — and Next Sentence Prediction (NSP) — predict whether two sentences are consecutive. MLM forces bidirectional understanding because the model must use both left and right context to predict masked words. This is fundamentally different from GPT's left-to-right approach.

Why BERT Still Matters

In the LLM era, BERT-family models (RoBERTa, DeBERTa, DistilBERT) remain the backbone of production NLP. They're 100x smaller than LLMs (110M–340M parameters vs. billions), 10x faster for inference, and often better for tasks that don't require generation. Most embedding models used in RAG and semantic search are BERT descendants. Google Search used BERT extensively before transitioning to larger models.

BERT vs. GPT: The Architecture Split

BERT (encoder-only, bidirectional) and GPT (decoder-only, left-to-right) represent two philosophies. BERT sees the whole input at once — perfect for understanding. GPT sees only what came before — perfect for generating. The field initially thought encoder-decoder (T5) would win by combining both. Instead, decoder-only (GPT approach) won for LLMs because it scales more cleanly, and you can approximate bidirectional understanding through clever prompting.

BERT

为什么重要

Deep Dive

Why BERT Still Matters

BERT vs. GPT: The Architecture Split

相关概念