Zubnet AI學習Wiki › Constitutional AI
Safety

Constitutional AI

CAI
Anthropic 開發的一種對齊技術,訓練一個模型遵循一套原則(一部「憲法」),而不是僅僅依賴人類反饋來做每個決定。模型根據這些原則批評並修改自己的輸出,然後在修改後的輸出上訓練。這減少了對人工標註者的需求,讓對齊標準變得明確且可稽核。

為什麼重要

Constitutional AI 解決了 RLHF 的兩個問題:昂貴(每個訓練範例都需要人工標註)和不透明(標準隱含在標註者的判斷中)。透過把原則明確化,CAI 讓對齊更透明、更可擴展、更一致。這是 Claude 訓練方式的核心部分。

Deep Dive

The CAI process has two phases. First, supervised learning: the model generates responses, then a separate instance critiques those responses against the constitutional principles ("Does this response help with harmful activities?"), and revises them. The model is fine-tuned on the revised responses. Second, RL from AI feedback (RLAIF): instead of human preference labels, an AI model compares response pairs against the constitution and provides the preference signal for RL training.

The Constitution

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being honest and harmless," "Prefer responses that don't help with illegal activities," etc. The power of this approach is that principles can be modified, added, or removed without retraining from scratch — you update the constitution and re-run the critique-revision process. This makes alignment criteria explicit, debatable, and improvable.

Beyond Anthropic

The constitutional approach has influenced the broader alignment field. The idea of using AI feedback (RLAIF) to scale alignment beyond what human labeling can provide is now used by multiple labs. The concept of explicit, auditable alignment criteria — rather than implicit criteria embedded in labeler instructions — is becoming an industry best practice.

相關概念

← 所有術語
← Computer Vision Contamination →