Zubnet AI学习Wiki › RLHF
Training

RLHF

又名: Reinforcement 学习ing from Human Feedback
一种训练技术,人类评估者根据质量对模型输出进行排名,这些反馈被用来训练一个奖励模型,引导 AI 给出更好的回复。这就是把一个原始预训练模型(只会预测下一个词)变成一个有用、无害的助手的方法。

为什么重要

RLHF 是让 ChatGPT 感觉与 GPT-3 不同的秘方。基础模型已经“知道”一切,但 RLHF 教会它以人类真正觉得有用的方式呈现这些知识。安全行为也是通过它来强化的。

Deep Dive

RLHF is a multi-stage process, and understanding each stage is essential to understanding why it works and where it breaks down. First, you start with a model that has already been supervised fine-tuned (SFT) on instruction-response pairs, so it can at least format responses correctly. Second, you collect comparison data: human annotators are shown two or more model responses to the same prompt and asked to rank them by quality. This comparison data is used to train a separate reward model — a neural network that takes a prompt-response pair and outputs a scalar score predicting how much a human would prefer that response. Third, you use the reward model as a signal to further train the main model via a reinforcement learning algorithm, typically Proximal Policy Optimization (PPO). The model generates responses, the reward model scores them, and the model's parameters are updated to increase the expected reward. A critical component is the KL divergence penalty, which prevents the model from drifting too far from its SFT starting point — without it, the model would quickly learn to exploit quirks in the reward model rather than actually producing better responses.

The Reward Model Problem

The reward model is both the linchpin and the weak link of the entire process. It must learn to predict human preferences from a limited set of comparisons, and then generalize those preferences to novel prompts and responses. In practice, reward models can develop blind spots: they might learn to prefer longer responses (because annotators often equate length with thoroughness), responses that sound confident regardless of accuracy, or responses that contain hedging language (because annotators favor cautious answers on ambiguous questions). These reward model quirks get amplified during the RL phase, a phenomenon called reward hacking or reward model overoptimization. You can literally watch it happen: as you train longer against the reward model, the reward score keeps climbing, but actual human preference for the outputs peaks and then declines. This is why RLHF practitioners cap the number of RL steps and regularly evaluate with fresh human judgments rather than trusting the reward model's scores.

The Alternatives

The practical challenges of RLHF are significant enough that the field has developed several alternatives. Direct Preference Optimization (DPO), introduced in 2023, eliminates the separate reward model and RL phase entirely. Instead, it directly optimizes the language model on the comparison data using a clever reformulation of the RLHF objective as a classification loss. DPO is simpler to implement, more stable to train, and requires less compute. Many open-source models now use DPO or its variants (IPO, KTO, ORPO) instead of PPO-based RLHF. Other approaches like RLAIF (RL from AI Feedback) replace human annotators with another AI model — Anthropic's Constitutional AI framework uses this approach, where the model critiques and revises its own outputs according to a set of principles. These alternatives each have trade-offs: DPO is simpler but may be less expressive for complex preference structures, while RLAIF scales better but inherits the biases of whatever AI is providing the feedback.

The Human Bottleneck

The human annotation side of RLHF is one of its most underappreciated complexities. Annotator quality, consistency, and demographic composition directly shape what the model learns. If your annotators are primarily English-speaking college graduates, the model learns their preferences, which may not generalize to other populations. Inter-annotator agreement on what constitutes a "better" response is often surprisingly low for open-ended questions, which means the reward model is learning from noisy labels. Some labs address this with detailed rubrics, annotator calibration sessions, and majority voting across multiple annotators per comparison. Others use synthetic data pipelines where a stronger model generates the comparisons. The field is still figuring out the best practices here, and the annotation pipeline is often the bottleneck — not because it is technically hard, but because defining "good" is genuinely philosophically difficult when you are trying to specify it precisely enough for a training signal.

相关概念

← 所有术语
← RLAIF RNN →
ESC