BERT: Definition & Meaning — AI Wiki

Google(2018)基於 Transformer 的模型,透過引入雙向預訓練革新了 NLP — 每個 token 能 attend 到每個其他 token,給模型深度上下文理解。BERT 是 encoder-only 模型:它擅長理解文字(分類、搜尋、NER),但不能像 GPT 或 Claude 那樣生成文字。

為什麼重要

BERT 是現代最有影響力的 NLP 論文。它證明了在未標註文字上預訓練再在特定任務上 fine-tune 能壓垮每個現有 benchmark。即便 LLM 搶了風頭,BERT 風格的模型仍然驅動大多數生產中的搜尋引擎、embedding 系統、分類管線,因為它們對非生成任務來說比 LLM 更小、更快、更便宜。

Deep Dive

BERT's training uses two objectives: Masked Language Modeling (MLM) — randomly mask 15% of tokens and predict them from context — and Next Sentence Prediction (NSP) — predict whether two sentences are consecutive. MLM forces bidirectional understanding because the model must use both left and right context to predict masked words. This is fundamentally different from GPT's left-to-right approach.

Why BERT Still Matters

In the LLM era, BERT-family models (RoBERTa, DeBERTa, DistilBERT) remain the backbone of production NLP. They're 100x smaller than LLMs (110M–340M parameters vs. billions), 10x faster for inference, and often better for tasks that don't require generation. Most embedding models used in RAG and semantic search are BERT descendants. Google Search used BERT extensively before transitioning to larger models.

BERT vs. GPT: The Architecture Split

BERT (encoder-only, bidirectional) and GPT (decoder-only, left-to-right) represent two philosophies. BERT sees the whole input at once — perfect for understanding. GPT sees only what came before — perfect for generating. The field initially thought encoder-decoder (T5) would win by combining both. Instead, decoder-only (GPT approach) won for LLMs because it scales more cleanly, and you can approximate bidirectional understanding through clever prompting.

BERT

為什麼重要

Deep Dive

Why BERT Still Matters

BERT vs. GPT: The Architecture Split

相關概念