Context Length Extension: Definition & Meaning — AI Wiki

讓語言模型處理比訓練時見過的更長序列的技術。一個在 4K token 上訓練的模型可以透過修改它的位置編碼(通常是 RoPE)加上在更長序列上的短 fine-tuning 擴展到 32K 或 128K。這避免了從零開始在長序列上訓練的巨大成本。

為什麼重要

上下文長度擴展就是為什麼模型在僅兩年內從 4K 上下文視窗到 128K 再到 1M+。從零訓練一個模型在百萬 token 序列上的成本會高得不可行。擴展技術透過適配在較短序列上訓練的模型,讓長上下文模型變得實用,只需要原始訓練算力的一小部分。

Deep Dive

The core challenge: RoPE (Rotary Position Embeddings) encodes position using rotation angles. At positions beyond the training length, these angles become extrapolations that the model has never seen, causing attention patterns to break down. Extension techniques modify how positions map to rotation angles so that longer sequences produce angles within the model's trained range.

NTK-Aware Scaling

NTK-aware interpolation (Neural Tangent Kernel) adjusts RoPE frequencies non-uniformly: high-frequency components (important for local patterns) are preserved while low-frequency components (position-dependent) are interpolated. This preserves the model's ability to handle local patterns (word order, syntax) while extending its range for global position encoding. It's a one-line code change that dramatically improves length extrapolation.

YaRN

YaRN (Yet another RoPE extensioN) combines NTK-aware interpolation with an attention temperature correction and a small amount of fine-tuning on extended-length data (typically a few hundred steps). This produces models that handle 4–8x their original context length with minimal quality degradation. Most open-source long-context models (like long-context Llama or Mistral variants) use YaRN or similar techniques. The fine-tuning step is crucial — scaling alone works somewhat, but fine-tuning at the target length significantly improves quality.

Context Length Extension

為什麼重要

Deep Dive

NTK-Aware Scaling

YaRN

相關概念