Context Length Extension: Definition & Meaning — AI Wiki

让语言模型处理比训练时见过的更长序列的技术。一个在 4K token 上训练的模型可以通过修改它的位置编码(通常是 RoPE)加上在更长序列上的短 fine-tuning 扩展到 32K 或 128K。这避免了从零开始在长序列上训练的巨大成本。

为什么重要

上下文长度扩展就是为什么模型在仅两年内从 4K 上下文窗口到 128K 再到 1M+。从零训练一个模型在百万 token 序列上的成本会高得不可行。扩展技术通过适配在较短序列上训练的模型,让长上下文模型变得实用,只需要原始训练算力的一小部分。

Deep Dive

The core challenge: RoPE (Rotary Position Embeddings) encodes position using rotation angles. At positions beyond the training length, these angles become extrapolations that the model has never seen, causing attention patterns to break down. Extension techniques modify how positions map to rotation angles so that longer sequences produce angles within the model's trained range.

NTK-Aware Scaling

NTK-aware interpolation (Neural Tangent Kernel) adjusts RoPE frequencies non-uniformly: high-frequency components (important for local patterns) are preserved while low-frequency components (position-dependent) are interpolated. This preserves the model's ability to handle local patterns (word order, syntax) while extending its range for global position encoding. It's a one-line code change that dramatically improves length extrapolation.

YaRN

YaRN (Yet another RoPE extensioN) combines NTK-aware interpolation with an attention temperature correction and a small amount of fine-tuning on extended-length data (typically a few hundred steps). This produces models that handle 4–8x their original context length with minimal quality degradation. Most open-source long-context models (like long-context Llama or Mistral variants) use YaRN or similar techniques. The fine-tuning step is crucial — scaling alone works somewhat, but fine-tuning at the target length significantly improves quality.

Context Length Extension

为什么重要

Deep Dive

NTK-Aware Scaling

YaRN

相关概念