Embedding Layer: Definition & Meaning — AI Wiki

A lookup table that maps each token in the vocabulary to a dense vector (the token's embedding). When the model receives token ID 42, the embedding layer returns row 42 of a learned matrix. This vector is the model's initial representation of that token — the starting point for all subsequent processing through attention and feedforward layers.

Why it matters

The embedding layer is where text becomes math. Every LLM starts by converting discrete tokens (words, subwords) into continuous vectors that the neural network can process. The embedding table is also one of the largest components of small models — a 128K vocabulary with 4096-dimensional embeddings is 512 million parameters. Understanding this helps you reason about model sizes and vocabulary design.

Deep Dive

The embedding layer is just a matrix E of shape (vocab_size, model_dim). For token ID i, the embedding is E[i] — a simple row lookup, no computation. But these embeddings are learned during training: tokens that appear in similar contexts get similar embeddings. The classic example: the embeddings for "king" − "man" + "woman" ≈ "queen," showing that the embedding space captures semantic relationships.

Tied Embeddings

Many models share (tie) the embedding matrix with the output layer (the "unembedding" or "language model head"). The output layer converts hidden states back into vocabulary probabilities by computing a dot product with each token's embedding. Tying these layers means the same embedding both represents a token on input and predicts it on output, saving parameters and often improving quality. Most modern LLMs use tied embeddings.

Positional + Token Embeddings

The full input representation is typically: token_embedding + positional_encoding. The token embedding captures what the token means. The positional encoding captures where it appears in the sequence. In models with learned position embeddings (BERT), this is a second embedding table indexed by position. In models with RoPE (LLaMA), positional information is injected differently (by rotating Q and K vectors), and the embedding layer only handles token identity.

Embedding Layer

Why it matters

Deep Dive

Tied Embeddings

Positional + Token Embeddings

Related Concepts