RLHF is a multi-stage process, and understanding each stage is essential to understanding why it works and where it breaks down. First, you start with a model that has already been supervised fine-tuned (SFT) on instruction-response pairs, so it can at least format responses correctly. Second, you collect comparison data: human annotators are shown two or more model responses to the same prompt and asked to rank them by quality. This comparison data is used to train a separate reward model — a neural network that takes a prompt-response pair and outputs a scalar score predicting how much a human would prefer that response. Third, you use the reward model as a signal to further train the main model via a reinforcement learning algorithm, typically Proximal Policy Optimization (PPO). The model generates responses, the reward model scores them, and the model's parameters are updated to increase the expected reward. A critical component is the KL divergence penalty, which prevents the model from drifting too far from its SFT starting point — without it, the model would quickly learn to exploit quirks in the reward model rather than actually producing better responses.
The reward model is both the linchpin and the weak link of the entire process. It must learn to predict human preferences from a limited set of comparisons, and then generalize those preferences to novel prompts and responses. In practice, reward models can develop blind spots: they might learn to prefer longer responses (because annotators often equate length with thoroughness), responses that sound confident regardless of accuracy, or responses that contain hedging language (because annotators favor cautious answers on ambiguous questions). These reward model quirks get amplified during the RL phase, a phenomenon called reward hacking or reward model overoptimization. You can literally watch it happen: as you train longer against the reward model, the reward score keeps climbing, but actual human preference for the outputs peaks and then declines. This is why RLHF practitioners cap the number of RL steps and regularly evaluate with fresh human judgments rather than trusting the reward model's scores.
The practical challenges of RLHF are significant enough that the field has developed several alternatives. Direct Preference Optimization (DPO), introduced in 2023, eliminates the separate reward model and RL phase entirely. Instead, it directly optimizes the language model on the comparison data using a clever reformulation of the RLHF objective as a classification loss. DPO is simpler to implement, more stable to train, and requires less compute. Many open-source models now use DPO or its variants (IPO, KTO, ORPO) instead of PPO-based RLHF. Other approaches like RLAIF (RL from AI Feedback) replace human annotators with another AI model — Anthropic's Constitutional AI framework uses this approach, where the model critiques and revises its own outputs according to a set of principles. These alternatives each have trade-offs: DPO is simpler but may be less expressive for complex preference structures, while RLAIF scales better but inherits the biases of whatever AI is providing the feedback.
The human annotation side of RLHF is one of its most underappreciated complexities. Annotator quality, consistency, and demographic composition directly shape what the model learns. If your annotators are primarily English-speaking college graduates, the model learns their preferences, which may not generalize to other populations. Inter-annotator agreement on what constitutes a "better" response is often surprisingly low for open-ended questions, which means the reward model is learning from noisy labels. Some labs address this with detailed rubrics, annotator calibration sessions, and majority voting across multiple annotators per comparison. Others use synthetic data pipelines where a stronger model generates the comparisons. The field is still figuring out the best practices here, and the annotation pipeline is often the bottleneck — not because it is technically hard, but because defining "good" is genuinely philosophically difficult when you are trying to specify it precisely enough for a training signal.