Zubnet AIAprenderWiki › Constitutional AI
Safety

Constitutional AI

CAI
Una técnica de alineamiento desarrollada por Anthropic donde un modelo se entrena para seguir un conjunto de principios (una «constitución») en vez de depender solo del feedback humano para cada decisión. El modelo critica y revisa sus propias salidas basándose en estos principios, y luego se entrena en las salidas revisadas. Esto reduce la necesidad de anotadores humanos y hace los criterios de alineamiento explícitos y auditables.

Por qué importa

El Constitutional AI aborda dos problemas del RLHF: es caro (anotadores humanos para cada ejemplo de entrenamiento) y opaco (los criterios son implícitos en los juicios de los anotadores). Al hacer los principios explícitos, CAI hace el alineamiento más transparente, escalable y consistente. Es una parte central de cómo se entrena Claude.

Deep Dive

The CAI process has two phases. First, supervised learning: the model generates responses, then a separate instance critiques those responses against the constitutional principles ("Does this response help with harmful activities?"), and revises them. The model is fine-tuned on the revised responses. Second, RL from AI feedback (RLAIF): instead of human preference labels, an AI model compares response pairs against the constitution and provides the preference signal for RL training.

The Constitution

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being honest and harmless," "Prefer responses that don't help with illegal activities," etc. The power of this approach is that principles can be modified, added, or removed without retraining from scratch — you update the constitution and re-run the critique-revision process. This makes alignment criteria explicit, debatable, and improvable.

Beyond Anthropic

The constitutional approach has influenced the broader alignment field. The idea of using AI feedback (RLAIF) to scale alignment beyond what human labeling can provide is now used by multiple labs. The concept of explicit, auditable alignment criteria — rather than implicit criteria embedded in labeler instructions — is becoming an industry best practice.

Conceptos relacionados

← Todos los términos
← Computer Vision Contamination →