Masked Language Modeling: Definition & Meaning — AI Wiki

Un objetivo de entrenamiento auto-supervisado donde tokens aleatorios en la entrada son reemplazados con un token [MASK], y el modelo debe predecir los tokens originales del contexto. BERT popularizó MLM: enmascarar 15% de tokens, usar atención bidireccional para mirar contexto izquierdo y derecho, y predecir las palabras enmascaradas. Esto crea poderosos modelos de comprensión de texto (en contraposición a modelos de generación de texto).

Por qué importa

MLM es el objetivo de entrenamiento que creó BERT y toda la familia de modelos encoder que todavía impulsan la mayoría de sistemas de búsqueda, clasificación y embedding en producción. Entender MLM vs. causal language modeling (predicción del siguiente token) explica la división fundamental entre modelos de comprensión (BERT) y modelos de generación (GPT) — y por qué cada uno destaca en distintas tareas.

Deep Dive

The process: take a text sequence, randomly select 15% of positions. For those positions: 80% are replaced with [MASK], 10% are replaced with a random token, 10% are kept unchanged. The model must predict the original token at each selected position. The 80/10/10 split prevents the model from learning to only pay attention to [MASK] tokens, which don't appear during actual use.

Bidirectional Context

The key advantage of MLM over causal LM: the model sees both left and right context when making predictions. For the sentence "The [MASK] sat on the mat," the model uses both "The" (left context) and "sat on the mat" (right context) to predict "cat." This bidirectional understanding is why BERT-style models produce richer representations than left-to-right models for understanding tasks.

MLM vs. Causal LM

The trade-off: MLM creates excellent understanding (good for classification, search, NER) but can't generate text fluently (predicting masked tokens isn't the same as generating a sequence). Causal LM (predict the next token left-to-right) generates fluently but understands less deeply (only sees left context). This split drove the encoder-vs-decoder divergence in NLP. Modern LLMs are all causal (decoder-only) because generation is more commercially valuable, but MLM-trained models remain the backbone of search and classification.

Masked Language Modeling

Por qué importa

Deep Dive

Bidirectional Context

MLM vs. Causal LM

Conceptos relacionados