Zubnet AI学习Wiki › Constitutional AI
Safety

Constitutional AI

CAI
Anthropic 开发的一种对齐技术,训练一个模型遵循一套原则(一部“宪法”),而不是仅仅依赖人类反馈来做每个决定。模型根据这些原则批评并修改自己的输出,然后在修改后的输出上训练。这减少了对人工标注者的需求,让对齐标准变得明确且可审计。

为什么重要

Constitutional AI 解决了 RLHF 的两个问题:昂贵(每个训练样本都需要人工标注)和不透明(标准隐含在标注者的判断中)。通过把原则明确化,CAI 让对齐更透明、更可扩展、更一致。这是 Claude 训练方式的核心部分。

Deep Dive

The CAI process has two phases. First, supervised learning: the model generates responses, then a separate instance critiques those responses against the constitutional principles ("Does this response help with harmful activities?"), and revises them. The model is fine-tuned on the revised responses. Second, RL from AI feedback (RLAIF): instead of human preference labels, an AI model compares response pairs against the constitution and provides the preference signal for RL training.

The Constitution

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being honest and harmless," "Prefer responses that don't help with illegal activities," etc. The power of this approach is that principles can be modified, added, or removed without retraining from scratch — you update the constitution and re-run the critique-revision process. This makes alignment criteria explicit, debatable, and improvable.

Beyond Anthropic

The constitutional approach has influenced the broader alignment field. The idea of using AI feedback (RLAIF) to scale alignment beyond what human labeling can provide is now used by multiple labs. The concept of explicit, auditable alignment criteria — rather than implicit criteria embedded in labeler instructions — is becoming an industry best practice.

相关概念

← 所有术语
← Computer Vision Contamination →