Masked Language Modeling: Definition & Meaning — AI Wiki

एक self-supervised training objective जहाँ input में random tokens एक [MASK] token से replace होते हैं, और model को context से original tokens predict करने होते हैं। BERT ने MLM को popularize किया: 15% tokens mask करो, left और right context दोनों देखने के लिए bidirectional attention use करो, और masked words predict करो। ये powerful text understanding models create करता है (text generation models के विपरीत)।

यह क्यों matter करता है

MLM वो training objective है जिसने BERT और encoder models की पूरी family create की जो अभी भी अधिकांश production search, classification, और embedding systems को power देती है। MLM vs. causal language modeling (next-token prediction) समझना understanding models (BERT) और generation models (GPT) के बीच के fundamental split को explain करता है — और क्यों हर एक अलग tasks में excel करता है।

Deep Dive

The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.

Bidirectional Context

The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.

MLM vs. Causal LM

The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.

Masked Language Modeling

यह क्यों matter करता है

Deep Dive

Bidirectional Context

MLM vs. Causal LM

संबंधित अवधारणाएँ