Masked Language Modeling: Definition & Meaning — AI Wiki

一种自监督训练目标,输入中随机 token 被替换成 [MASK] token,模型必须从上下文预测原始 token。BERT 让 MLM 流行:mask 15% 的 token,用双向 attention 看左右上下文,预测被 mask 的词。这创造了强大的文本理解模型(与文本生成模型相对)。

为什么重要

MLM 是创造了 BERT 和整个 encoder 模型家族的训练目标,它们至今仍驱动大多数生产中的搜索、分类、embedding 系统。理解 MLM vs. 因果语言建模(下一 token 预测)能解释理解模型(BERT)与生成模型(GPT)之间的根本分歧 — 以及为什么它们各自擅长不同任务。

Deep Dive

The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.

Bidirectional Context

The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.

MLM vs. Causal LM

The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.

Masked Language Modeling

为什么重要

Deep Dive

Bidirectional Context

MLM vs. Causal LM

相关概念