Instruction Tuning: Definition & Meaning — AI Wiki

एक pre-trained language model को (instruction, response) pairs के dataset पर fine-tune करना ताकि उसे instructions follow करना सिखाया जा सके। एक base model जो सिर्फ text predict करता है, एक ऐसा model बन जाता है जो questions के जवाब देता है, directions follow करता है, और assistant की तरह behave करता है। यही वो step है जो GPT को ChatGPT में, या base Llama को Llama-Chat में transform करता है।

यह क्यों matter करता है

Instruction tuning एक raw language model (जो सिर्फ text complete कर सकता है) और एक useful assistant (जो instructions follow कर सकता है) के बीच का bridge है। इसके बिना, सबसे capable base model भी सिर्फ plausible-sounding text generate करता है, actually जो आप पूछते हैं वो नहीं करता। ये arguably सबसे important post-training step है।

Deep Dive

The process: collect thousands to millions of (instruction, ideal response) pairs covering diverse tasks — Q&A, summarization, coding, creative writing, math, conversation. Fine-tune the base model on these pairs using standard supervised learning (minimize the loss on the response tokens given the instruction). The model learns the meta-pattern: "when given an instruction, produce a helpful response."

SFT vs. RLHF vs. DPO

Instruction tuning (Supervised Fine-Tuning / SFT) is typically the first post-training step, followed by alignment via RLHF or DPO. SFT teaches the model the format and basic helpfulness. RLHF/DPO then refines the behavior — making responses more helpful, less harmful, and better calibrated. Some approaches (like ORPO) combine SFT and preference alignment into a single step.

Data Quality Over Quantity

Research consistently shows that a small set of high-quality instruction-response pairs outperforms a large set of low-quality ones. The LIMA paper (Zhou et al., 2023) showed that fine-tuning with just 1,000 carefully curated examples could produce surprisingly good results. The key is diversity (covering many task types) and quality (responses that are genuinely excellent, not just adequate). This is why instruction data curation has become a specialized discipline.

Instruction Tuning

यह क्यों matter करता है

Deep Dive

SFT vs. RLHF vs. DPO

Data Quality Over Quantity

संबंधित अवधारणाएँ