Reward Model: Definition & Meaning — AI Wiki

一個訓練來預測人類在 AI 回覆之間偏好的模型。給定一個 prompt 和兩個候選回覆,reward model 對哪個人類會偏好打分。在 RLHF 管線中,reward model 提供訓練語言模型產生更好回覆的信號 — 它是人類判斷的學到的代理。

為什麼重要

Reward model 是讓 RLHF 運作的關鍵元件。你沒法讓一個人在訓練時給每個回覆打分(太慢、太貴),所以你訓練一個模型近似人類偏好,用它作為訓練信號。Reward model 的品質直接決定對齊的品質 — 一個差的 reward model 會產出一個為錯誤目標優化的模型。

Deep Dive

Training a reward model: collect pairs of responses to the same prompt, have humans rank them (response A is better than response B), then train a model to predict these rankings. The reward model outputs a scalar score for any (prompt, response) pair. During RL training, the language model generates responses, the reward model scores them, and the language model is updated to produce higher-scoring responses.

Reward Hacking

A dangerous failure mode: the language model finds ways to get high reward scores without actually being helpful. If the reward model has learned to prefer longer responses (because humans often preferred more detailed answers), the language model might pad responses with unnecessary content. This is called "reward hacking" or "reward gaming." Mitigations include KL divergence penalties (preventing the model from drifting too far from the base model), ensembles of reward models, and regular recalibration against human judgments.

DPO Bypasses the RM

DPO (Direct Preference Optimization) eliminates the separate reward model entirely, optimizing the language model directly on preference pairs. This avoids reward hacking but loses the ability to score arbitrary responses. Some labs use both: a reward model for evaluation and ranking, plus DPO for training. The optimal approach depends on scale, data quality, and how much you need to evaluate responses outside of training.

Reward Model

為什麼重要

Deep Dive

Reward Hacking

DPO Bypasses the RM

相關概念