Zubnet AI學習Wiki › DPO
Training

DPO

Direct Preference Optimization
RLHF 的一種替代方案,用於將語言模型與人類偏好對齊。不是訓練一個單獨的獎勵模型,再用強化學習對其優化(RLHF 管線),而是 DPO 直接用偏好和被拒絕回應的配對來優化語言模型。它比 RLHF 更簡單、更穩定、運算更少,同時達到可比的結果。

為什麼重要

DPO 透過民主化對齊改變了遊戲規則。RLHF 需要一個複雜的多階段管線(收集偏好、訓練獎勵模型、跑 PPO),這出了名的難調。DPO 把這些壓縮成單個訓練步驟,讓偏好對齊對小團隊和開源專案都變得可及。許多近期的 open-weight 模型用 DPO 或其變體代替了 RLHF。

Deep Dive

The key insight of DPO (Rafailov et al., 2023) is mathematical: there's a closed-form mapping between the optimal policy under a reward function and the reward function itself. This means you can skip the reward model entirely and directly adjust the language model's probabilities to prefer the chosen response over the rejected one. The loss function is elegantly simple — it increases the log-probability of preferred responses relative to rejected ones, with a reference model as anchor to prevent the policy from drifting too far.

The Preference Data

Like RLHF, DPO needs preference data: pairs of responses where a human (or another model) has indicated which is better. The quality and diversity of these pairs matters enormously. If all your preference pairs are about formatting, the model learns to format well but doesn't improve on substance. The annotation guidelines, the diversity of prompts, and the quality of annotators are where alignment efforts actually succeed or fail — the algorithm is just the last step.

Variants and Evolution

DPO spawned a family of related techniques: IPO (Identity Preference Optimization) addresses overfitting issues, KTO (Kahneman-Tversky Optimization) works with binary feedback instead of pairwise comparisons, and ORPO (Odds Ratio Preference Optimization) combines supervised fine-tuning with preference alignment in a single step. The field is moving fast, but the core insight — you don't need RL to align models — remains foundational.

相關概念

← 所有術語
← Distributed Training Dropout →