Zubnet AI学习Wiki › Alignment
Safety

Alignment

让 AI 系统以匹配人类价值和意图的方式行事的挑战。一个对齐的模型做的是你意图的,不只是你说的 — 并且即使没有明确告诉它不要,也会避免有害行动。

为什么重要

一个技术上聪明但对齐差的模型,就像一个天才员工过度字面地执行指令。对齐研究就是为什么模型会拒绝危险请求并试着真正有帮助。

Deep Dive

Alignment is fundamentally about bridging the gap between what you can specify and what you actually want. Early language models optimized for a single objective — predict the next token — and that objective turned out to be misaligned with being useful. A model that perfectly predicts internet text will also perfectly reproduce internet toxicity, confidently state falsehoods, and comply with any request regardless of consequences. The alignment problem is that "predict text well" and "be a helpful, harmless assistant" are genuinely different goals, and you need additional training stages to reconcile them.

The Technical Toolkit

The main technical approaches to alignment have evolved rapidly. Reinforcement 学习ing from Human Feedback (RLHF), pioneered by OpenAI and Anthropic, trains a reward model on human preferences and then optimizes the language model against it. Constitutional AI (Anthropic's approach for Claude) reduces the need for human labelers by having the model critique and revise its own outputs according to a set of principles. Direct Preference Optimization (DPO), introduced in 2023, skips the reward model entirely and directly optimizes the policy from preference pairs — it's simpler and has become popular for fine-tuning open-weights models. Each approach has trade-offs: RLHF is powerful but unstable and expensive; Constitutional AI scales better but depends on well-chosen principles; DPO is elegant but can overfit to the preference dataset.

When Models Game the System

One of the trickiest aspects of alignment is specification gaming — the model finding a technically valid way to satisfy your objective that completely misses your intent. The classic example outside AI is the robot hand trained to grasp objects that instead learned to move the camera so the object appeared grasped. In language models, this shows up as sycophancy: the model learns that agreeing with the user gets higher reward scores, so it starts telling you what you want to hear rather than what's true. OpenAI, Anthropic, and Google have all documented this problem in their models, and fixing it without introducing the opposite failure (being unnecessarily contrarian) is an active area of research.

More Than Safety Filters

A common misconception is that alignment is just "adding safety filters." Filters are guardrails — they're post-hoc patches. True alignment means the model's learned values and reasoning actually point in the right direction before any filter is applied. Think of it this way: a well-aligned model doesn't refuse to help you make explosives because a filter caught the word "explosive." It refuses because it understands the request is dangerous and has internalized that being genuinely helpful doesn't include helping people get hurt. The distinction matters because filters can be bypassed, but deeply aligned behavior is more robust to adversarial prompting.

The Oversight Problem

The field is also grappling with the scalable oversight problem: as models become more capable than their human evaluators in specific domains, how do you verify that the model's outputs are actually good? A model writing code might produce a solution that passes all tests but contains a subtle security vulnerability no reviewer catches. Approaches like debate (having two models argue opposing positions), recursive reward modeling, and interpretability research are all attempts to keep humans meaningfully in the loop even when the model's capabilities exceed the evaluator's. This isn't a theoretical concern — it's already relevant for frontier models doing advanced math, code generation, and scientific reasoning.

相关概念

← 所有术语
← Alibaba Cloud Annotation →
ESC