Embedding Layer: Definition & Meaning — AI Wiki

一個查找表,把詞彙表裡每個 token 對應到一個密集向量(那個 token 的 embedding)。當模型收到 token ID 42,embedding layer 返回一個學到的矩陣的第 42 行。這個向量是模型對那個 token 的初始表示 — 透過後續 attention 和前饋層所有處理的起點。

為什麼重要

Embedding layer 是文字變成數學的地方。每個 LLM 都從把離散 token(詞、子詞)轉換成神經網路能處理的連續向量開始。Embedding 表也是小模型最大的元件之一 — 128K 詞彙表加 4096 維 embedding 就是 5.12 億參數。理解這個能幫你推理模型大小和詞彙表設計。

Deep Dive

The embedding layer is just a matrix E of shape (vocab_size, model_dim). For token ID i, the embedding is E[i] — a simple row lookup, no computation. But these embeddings are learned during training: tokens that appear in similar contexts get similar embeddings. The classic example: the embeddings for "king" − "man" + "woman" ≈ "queen," showing that the embedding space captures semantic relationships.

Tied Embeddings

Many models share (tie) the embedding matrix with the output layer (the "unembedding" or "language model head"). The output layer converts hidden states back into vocabulary probabilities by computing a dot product with each token's embedding. Tying these layers means the same embedding both represents a token on input and predicts it on output, saving parameters and often improving quality. Most modern LLMs use tied embeddings.

Positional + Token Embeddings

The full input representation is typically: token_embedding + positional_encoding. The token embedding captures what the token means. The positional encoding captures where it appears in the sequence. In models with learned position embeddings (BERT), this is a second embedding table indexed by position. In models with RoPE (LLaMA), positional information is injected differently (by rotating Q and K vectors), and the embedding layer only handles token identity.

Embedding Layer

為什麼重要

Deep Dive

Tied Embeddings

Positional + Token Embeddings

相關概念