Instruction Tuning: Definition & Meaning — AI Wiki

在一个(指令,回复)配对的数据集上对一个预训练语言模型做 fine-tuning,教它跟随指令。一个只会预测文本的基础模型,变成一个能回答问题、遵循指示、像助手一样表现的模型。这就是把 GPT 变成 ChatGPT、或者把基础 Llama 变成 Llama-Chat 的那一步。

为什么重要

Instruction tuning 是一个原始语言模型(只能补全文本)和一个有用助手(能跟随指令)之间的桥梁。没有它,连最强的基础模型也只是生成听起来靠谱的文本,而不是真的做你要求的事。它可以说是最重要的后训练步骤。

Deep Dive

The process: collect thousands to millions of (instruction, ideal response) pairs covering diverse tasks — Q&A, summarization, coding, creative writing, math, conversation. Fine-tune the base model on these pairs using standard supervised learning (minimize the loss on the response tokens given the instruction). The model learns the meta-pattern: "when given an instruction, produce a helpful response."

SFT vs. RLHF vs. DPO

Instruction tuning (Supervised Fine-Tuning / SFT) is typically the first post-training step, followed by alignment via RLHF or DPO. SFT teaches the model the format and basic helpfulness. RLHF/DPO then refines the behavior — making responses more helpful, less harmful, and better calibrated. Some approaches (like ORPO) combine SFT and preference alignment into a single step.

Data Quality Over Quantity

Research consistently shows that a small set of high-quality instruction-response pairs outperforms a large set of low-quality ones. The LIMA paper (Zhou et al., 2023) showed that fine-tuning with just 1,000 carefully curated examples could produce surprisingly good results. The key is diversity (covering many task types) and quality (responses that are genuinely excellent, not just adequate). This is why instruction data curation has become a specialized discipline.

Instruction Tuning

为什么重要

Deep Dive

SFT vs. RLHF vs. DPO

Data Quality Over Quantity

相关概念