Masked Language Modeling: Definition & Meaning — AI Wiki

一種自監督訓練目標,輸入中隨機 token 被替換成 [MASK] token,模型必須從上下文預測原始 token。BERT 讓 MLM 流行:mask 15% 的 token,用雙向 attention 看左右上下文,預測被 mask 的詞。這創造了強大的文字理解模型(與文字生成模型相對)。

為什麼重要

MLM 是創造了 BERT 和整個 encoder 模型家族的訓練目標,它們至今仍驅動大多數生產中的搜尋、分類、embedding 系統。理解 MLM vs. 因果語言建模(下一 token 預測)能解釋理解模型(BERT)與生成模型(GPT)之間的根本分歧 — 以及為什麼它們各自擅長不同任務。

Deep Dive

The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.

Bidirectional Context

The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.

MLM vs. Causal LM

The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.

Masked Language Modeling

為什麼重要

Deep Dive

Bidirectional Context

MLM vs. Causal LM

相關概念