The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.
The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.
The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.