DPO: Definition & Meaning — AI Wiki

Uma alternativa ao RLHF para alinhar modelos de linguagem com preferências humanas. Em vez de treinar um modelo de recompensa separado e depois usar aprendizado por reforço para otimizar contra ele (o pipeline RLHF), DPO otimiza diretamente o modelo de linguagem usando pares de respostas preferidas e rejeitadas. É mais simples, mais estável e exige menos compute que RLHF, obtendo resultados comparáveis.

Por que importa

DPO mudou o jogo do alinhamento democratizando-o. RLHF exige um pipeline multi-etapas complexo (coletar preferências, treinar modelo de recompensa, rodar PPO) que é notoriamente caprichoso. DPO colapsa isso num único passo de treinamento, tornando alinhamento por preferência acessível a equipes menores e projetos open-source. Muitos modelos open-weight recentes usam DPO ou suas variantes em vez de RLHF.

Deep Dive

The key insight of DPO (Rafailov et al., 2023) is mathematical: there's a closed-form mapping between the optimal policy under a reward function and the reward function itself. This means you can skip the reward model entirely and directly adjust the language model's probabilities to prefer the chosen response over the rejected one. The loss function is elegantly simple — it increases the log-probability of preferred responses relative to rejected ones, with a reference model as anchor to prevent the policy from drifting too far.

The Preference Data

Like RLHF, DPO needs preference data: pairs of responses where a human (or another model) has indicated which is better. The quality and diversity of these pairs matters enormously. If all your preference pairs are about formatting, the model learns to format well but doesn't improve on substance. The annotation guidelines, the diversity of prompts, and the quality of annotators are where alignment efforts actually succeed or fail — the algorithm is just the last step.

Variants and Evolution

DPO spawned a family of related techniques: IPO (Identity Preference Optimization) addresses overfitting issues, KTO (Kahneman-Tversky Optimization) works with binary feedback instead of pairwise comparisons, and ORPO (Odds Ratio Preference Optimization) combines supervised fine-tuning with preference alignment in a single step. The field is moving fast, but the core insight — you don't need RL to align models — remains foundational.

DPO

Por que importa

Deep Dive

The Preference Data

Variants and Evolution

Conceitos relacionados