Zubnet AIApprendreWiki › Positional Encoding
Fondamentaux

Positional Encoding

Positional Embedding, RoPE, ALiBi
Un mécanisme qui dit à un modèle Transformer l'ordre des tokens dans une séquence. Contrairement aux RNN qui traitent les tokens séquentiellement (donc la position est implicite), les Transformers traitent tous les tokens en parallèle et n'ont pas de sens inhérent de l'ordre. Les positional encodings injectent de l'information de position pour que le modèle sache que « dog bites man » et « man bites dog » sont différents.

Pourquoi c'est important

Sans information positionnelle, un Transformer traite une phrase comme un sac de mots — l'ordre des mots est perdu. Le choix de positional encoding détermine aussi à quel point un modèle gère bien des séquences plus longues que celles vues pendant l'entraînement, c'est pourquoi des techniques comme RoPE et ALiBi sont critiques pour les modèles à long contexte.

Deep Dive

The original Transformer (2017) used fixed sinusoidal functions at different frequencies for each position and dimension. These had a nice theoretical property: the model could learn to attend to relative positions because the sinusoidal patterns create consistent offsets. But learned positional embeddings (a trainable vector for each position) quickly became the default because they performed slightly better, despite being limited to the maximum training length.

RoPE: The Modern Standard

Rotary Position Embeddings (RoPE, Su et al., 2021) encode position by rotating the query and key vectors in the attention mechanism. The angle of rotation depends on position, so the dot product between two tokens naturally encodes their relative distance. RoPE is used by LLaMA, Mistral, Qwen, and most modern LLMs. Its key advantage: it enables length extrapolation — models can handle sequences somewhat longer than those seen during training, especially when combined with techniques like YaRN or NTK-aware scaling.

ALiBi and Beyond

ALiBi (Attention with Linear Biases) takes a simpler approach: instead of modifying embeddings, it adds a linear penalty to attention scores based on distance between tokens. Farther tokens get penalized more. This requires no learned parameters and extrapolates well to longer sequences. Some architectures combine approaches or use relative position biases. The trend is toward methods that generalize beyond the training length, since context windows keep growing.

Concepts liés

← Tous les termes
← Pooling Pre-training →