Masked Language Modeling: Definition & Meaning — AI Wiki

Um objetivo de treinamento auto-supervisionado onde tokens aleatórios na entrada são substituídos por um token [MASK], e o modelo deve prever os tokens originais do contexto. BERT popularizou MLM: mascarar 15% dos tokens, usar atenção bidirecional para olhar contexto esquerdo e direito, e prever as palavras mascaradas. Isso cria modelos poderosos de compreensão de texto (em oposição a modelos de geração de texto).

Por que importa

MLM é o objetivo de treinamento que criou o BERT e toda a família de modelos encoder que ainda movem a maioria dos sistemas de busca, classificação e embedding em produção. Entender MLM vs. causal language modeling (previsão do próximo token) explica a divisão fundamental entre modelos de compreensão (BERT) e modelos de geração (GPT) — e por que cada um se destaca em tarefas diferentes.

Deep Dive

The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.

Bidirectional Context

The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.

MLM vs. Causal LM

The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.

Masked Language Modeling

Por que importa

Deep Dive

Bidirectional Context

MLM vs. Causal LM

Conceitos relacionados