Zubnet AILearnWiki › Constitutional AI
Safety

Constitutional AI

CAI
An alignment technique developed by Anthropic where a model is trained to follow a set of principles (a "constitution") rather than relying solely on human feedback for every decision. The model critiques and revises its own outputs based on these principles, then is trained on the revised outputs. This reduces the need for human labelers and makes the alignment criteria explicit and auditable.

Why it matters

Constitutional AI addresses two problems with RLHF: it's expensive (human labelers for every training example) and opaque (the criteria are implicit in labeler judgments). By making the principles explicit, CAI makes alignment more transparent, scalable, and consistent. It's a core part of how Claude is trained.

Deep Dive

The CAI process has two phases. First, supervised learning: the model generates responses, then a separate instance critiques those responses against the constitutional principles ("Does this response help with harmful activities?"), and revises them. The model is fine-tuned on the revised responses. Second, RL from AI feedback (RLAIF): instead of human preference labels, an AI model compares response pairs against the constitution and provides the preference signal for RL training.

The Constitution

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being honest and harmless," "Prefer responses that don't help with illegal activities," etc. The power of this approach is that principles can be modified, added, or removed without retraining from scratch — you update the constitution and re-run the critique-revision process. This makes alignment criteria explicit, debatable, and improvable.

Beyond Anthropic

The constitutional approach has influenced the broader alignment field. The idea of using AI feedback (RLAIF) to scale alignment beyond what human labeling can provide is now used by multiple labs. The concept of explicit, auditable alignment criteria — rather than implicit criteria embedded in labeler instructions — is becoming an industry best practice.

Related Concepts

← All Terms
← Computer Vision Contamination →