Positional Encoding: Definition & Meaning — AI Wiki

告訴 Transformer 模型序列中 token 順序的機制。不像 RNN 按順序處理 token(所以位置是隱含的),Transformer 並行處理所有 token,沒有內在的順序感。位置編碼注入位置資訊,讓模型知道「dog bites man」和「man bites dog」是不同的。

為什麼重要

沒有位置資訊,Transformer 把一句話當作詞袋 — 詞序丟失。位置編碼的選擇也決定了一個模型處理比訓練時見過的更長的序列有多好,這就是 RoPE、ALiBi 這類技術對長上下文模型至關重要的原因。

Deep Dive

The original Transformer (2017) used fixed sinusoidal functions at different frequencies for each position and dimension. These had a nice theoretical property: the model could learn to attend to relative positions because the sinusoidal patterns create consistent offsets. But learned positional embeddings (a trainable vector for each position) quickly became the default because they performed slightly better, despite being limited to the maximum training length.

RoPE: The Modern Standard

Rotary Position Embeddings (RoPE, Su et al., 2021) encode position by rotating the query and key vectors in the attention mechanism. The angle of rotation depends on position, so the dot product between two tokens naturally encodes their relative distance. RoPE is used by LLaMA, Mistral, Qwen, and most modern LLMs. Its key advantage: it enables length extrapolation — models can handle sequences somewhat longer than those seen during training, especially when combined with techniques like YaRN or NTK-aware scaling.

ALiBi and Beyond

ALiBi (Attention with Linear Biases) takes a simpler approach: instead of modifying embeddings, it adds a linear penalty to attention scores based on distance between tokens. Farther tokens get penalized more. This requires no learned parameters and extrapolates well to longer sequences. Some architectures combine approaches or use relative position biases. The trend is toward methods that generalize beyond the training length, since context windows keep growing.

Positional Encoding

為什麼重要

Deep Dive

RoPE: The Modern Standard

ALiBi and Beyond

相關概念