Zubnet AIसीखेंWiki › Constitutional AI
Safety

Constitutional AI

CAI
Anthropic द्वारा developed एक alignment technique जहाँ एक model principles के एक set (एक “constitution”) follow करने के लिए train किया जाता है, हर decision के लिए solely human feedback पर depend करने के बजाय। Model इन principles के आधार पर अपने outputs को critique और revise करता है, फिर revised outputs पर train होता है। ये human labelers की ज़रूरत कम करता है और alignment criteria को explicit और auditable बनाता है।

यह क्यों matter करता है

Constitutional AI RLHF की दो problems address करती है: expensive होना (हर training example के लिए human labelers) और opaque होना (criteria labeler judgments में implicit हैं)। Principles को explicit बनाकर, CAI alignment को ज़्यादा transparent, scalable, और consistent बनाती है। ये Claude को कैसे train किया जाता है उसका एक core part है।

Deep Dive

The CAI process has two phases. First, supervised learning: the model generates responses, then a separate instance critiques those responses against the constitutional principles ("Does this response help with harmful activities?"), and revises them. The model is fine-tuned on the revised responses. Second, RL from AI feedback (RLAIF): instead of human preference labels, an AI model compares response pairs against the constitution and provides the preference signal for RL training.

The Constitution

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being honest and harmless," "Prefer responses that don't help with illegal activities," etc. The power of this approach is that principles can be modified, added, or removed without retraining from scratch — you update the constitution and re-run the critique-revision process. This makes alignment criteria explicit, debatable, and improvable.

Beyond Anthropic

The constitutional approach has influenced the broader alignment field. The idea of using AI feedback (RLAIF) to scale alignment beyond what human labeling can provide is now used by multiple labs. The concept of explicit, auditable alignment criteria — rather than implicit criteria embedded in labeler instructions — is becoming an industry best practice.

संबंधित अवधारणाएँ

← सभी Terms
← Computer Vision Contamination →