Instruction Tuning: Definition & Meaning — AI Wiki

在一個(指令,回覆)配對的資料集上對一個預訓練語言模型做 fine-tuning,教它遵循指令。一個只會預測文字的基礎模型,變成一個能回答問題、遵循指示、像助手一樣表現的模型。這就是把 GPT 變成 ChatGPT、或把基礎 Llama 變成 Llama-Chat 的那一步。

為什麼重要

Instruction tuning 是一個原始語言模型(只能補全文字)和一個有用助手(能遵循指令)之間的橋樑。沒有它,連最強的基礎模型也只是生成聽起來合理的文字,而不是真的做你要求的事。它可以說是最重要的後訓練步驟。

Deep Dive

The process: collect thousands to millions of (instruction, ideal response) pairs covering diverse tasks — Q&A, summarization, coding, creative writing, math, conversation. Fine-tune the base model on these pairs using standard supervised learning (minimize the loss on the response tokens given the instruction). The model learns the meta-pattern: "when given an instruction, produce a helpful response."

SFT vs. RLHF vs. DPO

Instruction tuning (Supervised Fine-Tuning / SFT) is typically the first post-training step, followed by alignment via RLHF or DPO. SFT teaches the model the format and basic helpfulness. RLHF/DPO then refines the behavior — making responses more helpful, less harmful, and better calibrated. Some approaches (like ORPO) combine SFT and preference alignment into a single step.

Data Quality Over Quantity

Research consistently shows that a small set of high-quality instruction-response pairs outperforms a large set of low-quality ones. The LIMA paper (Zhou et al., 2023) showed that fine-tuning with just 1,000 carefully curated examples could produce surprisingly good results. The key is diversity (covering many task types) and quality (responses that are genuinely excellent, not just adequate). This is why instruction data curation has become a specialized discipline.

Instruction Tuning

為什麼重要

Deep Dive

SFT vs. RLHF vs. DPO

Data Quality Over Quantity

相關概念