DPO: Definition & Meaning — AI Wiki

RLHF का एक alternative, language models को human preferences के साथ align करने के लिए। एक separate reward model train करके फिर उसके against optimize करने के लिए reinforcement learning use करने के बजाय (RLHF pipeline), DPO language model को directly optimize करता है preferred और rejected responses के pairs use करके। ये RLHF से simpler, ज़्यादा stable है और कम compute चाहता है, फिर भी comparable results देता है।

यह क्यों matter करता है

DPO ने alignment game को democratize करके बदल दिया। RLHF को एक complex multi-stage pipeline चाहिए (preferences collect करो, reward model train करो, PPO run करो) जो notoriously finicky है। DPO इसे एक single training step में collapse कर देता है, जिससे preference alignment smaller teams और open-source projects के लिए accessible हो जाती है। बहुत से recent open-weight models RLHF की जगह DPO या उसके variants use करते हैं।

Deep Dive

The key insight of DPO (Rafailov et al., 2023) is mathematical: there's a closed-form mapping between the optimal policy under a reward function and the reward function itself. This means you can skip the reward model entirely and directly adjust the language model's probabilities to prefer the chosen response over the rejected one. The loss function is elegantly simple — it increases the log-probability of preferred responses relative to rejected ones, with a reference model as anchor to prevent the policy from drifting too far.

The Preference Data

Like RLHF, DPO needs preference data: pairs of responses where a human (or another model) has indicated which is better. The quality and diversity of these pairs matters enormously. If all your preference pairs are about formatting, the model learns to format well but doesn't improve on substance. The annotation guidelines, the diversity of prompts, and the quality of annotators are where alignment efforts actually succeed or fail — the algorithm is just the last step.

Variants and Evolution

DPO spawned a family of related techniques: IPO (Identity Preference Optimization) addresses overfitting issues, KTO (Kahneman-Tversky Optimization) works with binary feedback instead of pairwise comparisons, and ORPO (Odds Ratio Preference Optimization) combines supervised fine-tuning with preference alignment in a single step. The field is moving fast, but the core insight — you don't need RL to align models — remains foundational.

DPO

यह क्यों matter करता है

Deep Dive

The Preference Data

Variants and Evolution

संबंधित अवधारणाएँ