Embedding Layer: Definition & Meaning — AI Wiki

Una tabla de lookup que mapea cada token en el vocabulario a un vector denso (el embedding del token). Cuando el modelo recibe el token ID 42, la embedding layer devuelve la fila 42 de una matriz aprendida. Este vector es la representación inicial del modelo de ese token — el punto de partida para todo el procesamiento subsiguiente a través de capas de atención y feedforward.

Por qué importa

La embedding layer es donde el texto se vuelve matemáticas. Cada LLM empieza convirtiendo tokens discretos (palabras, subpalabras) en vectores continuos que la red neuronal puede procesar. La tabla de embedding también es uno de los componentes más grandes de modelos pequeños — un vocabulario de 128K con embeddings de 4096 dimensiones son 512 millones de parámetros. Entender esto te ayuda a razonar sobre tamaños de modelo y diseño de vocabulario.

Deep Dive

The embedding layer is just a matrix E of shape (vocab_size, model_dim). For token ID i, the embedding is E[i] — a simple row lookup, no computation. But these embeddings are learned during training: tokens that appear in similar contexts get similar embeddings. The classic example: the embeddings for "king" − "man" + "woman" ≈ "queen," showing that the embedding space captures semantic relationships.

Tied Embeddings

Many models share (tie) the embedding matrix with the output layer (the "unembedding" or "language model head"). The output layer converts hidden states back into vocabulary probabilities by computing a dot product with each token's embedding. Tying these layers means the same embedding both represents a token on input and predicts it on output, saving parameters and often improving quality. Most modern LLMs use tied embeddings.

Positional + Token Embeddings

The full input representation is typically: token_embedding + positional_encoding. The token embedding captures what the token means. The positional encoding captures where it appears in the sequence. In models with learned position embeddings (BERT), this is a second embedding table indexed by position. In models with RoPE (LLaMA), positional information is injected differently (by rotating Q and K vectors), and the embedding layer only handles token identity.

Embedding Layer

Por qué importa

Deep Dive

Tied Embeddings

Positional + Token Embeddings

Conceptos relacionados