Word Embedding: Definition & Meaning — AI Wiki

詞的密集向量表示,意思相近的詞有相近的向量。Word2Vec(2013)和 GloVe(2014)開創了這個:它們在詞共現模式上訓練,產出這樣的向量 — 「國王 − 男人 + 女人 ≈ 女王」。詞嵌入是現代上下文嵌入(BERT、sentence-transformers)的先驅,至今仍然是理解神經網路如何表示語言的基礎。

為什麼重要

詞嵌入是讓神經 NLP 實用的突破。在它之前,詞用 one-hot 向量表示(沒有相似性概念)。詞嵌入證明了分布式表示可以捕捉含義、類比、語意關係。這個洞見 — 把離散符號表示為學到的連續向量 — 是所有現代語言模型的基礎。

Deep Dive

Word2Vec (Mikolov et al., 2013, Google) trains by either predicting a word from its context (CBOW) or predicting context from a word (Skip-gram). GloVe (Pennington et al., 2014, Stanford) factorizes the word co-occurrence matrix. Both produce similar results: 100–300 dimensional vectors where cosine similarity correlates with semantic similarity. These vectors capture remarkable relationships: countries map to capitals, verbs map to tenses, and analogies are solvable through vector arithmetic.

Static vs. Contextual

Word2Vec and GloVe produce one vector per word, regardless of context. "Bank" gets the same embedding whether it means "river bank" or "financial bank." Contextual embeddings (ELMo, then BERT) solved this by producing different representations depending on context. Modern sentence embeddings (from models like BGE, E5) go further, embedding entire sentences into vectors. Each generation improved on the last, but the core idea — meaning as a vector — started with Word2Vec.

The Legacy

Word2Vec's biggest contribution wasn't the algorithm but the demonstration that neural networks can learn useful representations of language from raw text. This proof of concept inspired the progression from word vectors to sentence vectors to contextual embeddings to full language models. The embedding layer of every LLM is a direct descendant of word embeddings: a lookup table mapping discrete tokens to learned continuous vectors, just at a much larger scale.

Word Embedding

為什麼重要

Deep Dive

Static vs. Contextual

The Legacy

相關概念