Zubnet AI学习Wiki › DPO
Training

DPO

Direct Preference Optimization
RLHF 的一种替代方案,用于将语言模型与人类偏好对齐。不是训练一个单独的奖励模型,再用强化学习对其优化(RLHF 管线),而是 DPO 直接用偏好和被拒绝响应的配对来优化语言模型。它比 RLHF 更简单、更稳定、计算更少,同时达到可比的结果。

为什么重要

DPO 通过民主化对齐改变了游戏规则。RLHF 需要一个复杂的多阶段管线(收集偏好、训练奖励模型、跑 PPO),这出了名的难调。DPO 把这些压缩成单个训练步骤,让偏好对齐对小团队和开源项目都变得可及。许多近期的 open-weight 模型用 DPO 或其变体代替了 RLHF。

Deep Dive

The key insight of DPO (Rafailov et al., 2023) is mathematical: there's a closed-form mapping between the optimal policy under a reward function and the reward function itself. This means you can skip the reward model entirely and directly adjust the language model's probabilities to prefer the chosen response over the rejected one. The loss function is elegantly simple — it increases the log-probability of preferred responses relative to rejected ones, with a reference model as anchor to prevent the policy from drifting too far.

The Preference Data

Like RLHF, DPO needs preference data: pairs of responses where a human (or another model) has indicated which is better. The quality and diversity of these pairs matters enormously. If all your preference pairs are about formatting, the model learns to format well but doesn't improve on substance. The annotation guidelines, the diversity of prompts, and the quality of annotators are where alignment efforts actually succeed or fail — the algorithm is just the last step.

Variants and Evolution

DPO spawned a family of related techniques: IPO (Identity Preference Optimization) addresses overfitting issues, KTO (Kahneman-Tversky Optimization) works with binary feedback instead of pairwise comparisons, and ORPO (Odds Ratio Preference Optimization) combines supervised fine-tuning with preference alignment in a single step. The field is moving fast, but the core insight — you don't need RL to align models — remains foundational.

相关概念

← 所有术语
← Distributed Training Dropout →