Zubnet AILearnWiki › DPO
Training

DPO

Direct Preference Optimization
An alternative to RLHF for aligning language models with human preferences. Instead of training a separate reward model and then using reinforcement learning to optimize against it (the RLHF pipeline), DPO directly optimizes the language model using pairs of preferred and rejected responses. It's simpler, more stable, and requires less compute than RLHF while achieving comparable results.

Why it matters

DPO changed the alignment game by democratizing it. RLHF requires a complex multi-stage pipeline (collect preferences, train reward model, run PPO) that's notoriously finicky. DPO collapses this into a single training step, making preference alignment accessible to smaller teams and open-source projects. Many recent open-weight models use DPO or its variants instead of RLHF.

Deep Dive

The key insight of DPO (Rafailov et al., 2023) is mathematical: there's a closed-form mapping between the optimal policy under a reward function and the reward function itself. This means you can skip the reward model entirely and directly adjust the language model's probabilities to prefer the chosen response over the rejected one. The loss function is elegantly simple — it increases the log-probability of preferred responses relative to rejected ones, with a reference model as anchor to prevent the policy from drifting too far.

The Preference Data

Like RLHF, DPO needs preference data: pairs of responses where a human (or another model) has indicated which is better. The quality and diversity of these pairs matters enormously. If all your preference pairs are about formatting, the model learns to format well but doesn't improve on substance. The annotation guidelines, the diversity of prompts, and the quality of annotators are where alignment efforts actually succeed or fail — the algorithm is just the last step.

Variants and Evolution

DPO spawned a family of related techniques: IPO (Identity Preference Optimization) addresses overfitting issues, KTO (Kahneman-Tversky Optimization) works with binary feedback instead of pairwise comparisons, and ORPO (Odds Ratio Preference Optimization) combines supervised fine-tuning with preference alignment in a single step. The field is moving fast, but the core insight — you don't need RL to align models — remains foundational.

Related Concepts

← All Terms
← Distributed Training Dropout →