Reward Model: Definition & Meaning — AI Wiki

一个训练来预测人类在 AI 回复之间偏好的模型。给定一个 prompt 和两个候选回复,reward model 对哪个人类会偏好打分。在 RLHF 管线中,reward model 提供训练语言模型产生更好回复的信号 — 它是人类判断的学到的代理。

为什么重要

Reward model 是让 RLHF 工作的关键组件。你没法让一个人在训练时给每个回复打分(太慢、太贵),所以你训练一个模型近似人类偏好,用它作为训练信号。Reward model 的质量直接决定对齐的质量 — 一个差的 reward model 会产出一个为错误目标优化的模型。

Deep Dive

Training a reward model: collect pairs of responses to the same prompt, have humans rank them (response A is better than response B), then train a model to predict these rankings. The reward model outputs a scalar score for any (prompt, response) pair. During RL training, the language model generates responses, the reward model scores them, and the language model is updated to produce higher-scoring responses.

Reward Hacking

A dangerous failure mode: the language model finds ways to get high reward scores without actually being helpful. If the reward model has learned to prefer longer responses (because humans often preferred more detailed answers), the language model might pad responses with unnecessary content. This is called "reward hacking" or "reward gaming." Mitigations include KL divergence penalties (preventing the model from drifting too far from the base model), ensembles of reward models, and regular recalibration against human judgments.

DPO Bypasses the RM

DPO (Direct Preference Optimization) eliminates the separate reward model entirely, optimizing the language model directly on preference pairs. This avoids reward hacking but loses the ability to score arbitrary responses. Some labs use both: a reward model for evaluation and ranking, plus DPO for training. The optimal approach depends on scale, data quality, and how much you need to evaluate responses outside of training.

Reward Model

为什么重要

Deep Dive

Reward Hacking

DPO Bypasses the RM

相关概念