Mechanistic Interpretability: Definition & Meaning — AI Wiki

Une approche de recherche qui essaie de comprendre ce qui se passe à l'intérieur des réseaux de neurones au niveau des neurones individuels, des circuits et des features — pas juste ce que le modèle produit, mais comment il calcule ces sorties. Le but est de reverse-engineer les algorithmes que les réseaux de neurones apprennent, de la façon dont tu reverse-engineerais du logiciel compilé pour comprendre son code source.

Pourquoi c'est important

Si on va faire confiance à l'IA avec des décisions importantes, on a besoin de comprendre comment elle les prend. L'interprétabilité mécanistique est la tentative la plus rigoureuse à ça — pas juste demander « qu'est-ce que le modèle a fait ? » mais « quel algorithme a-t-il implémenté et pourquoi ? ». C'est central à la recherche en sécurité IA, particulièrement chez Anthropic, et ça produit de vrais résultats : les chercheurs ont identifié des circuits pour l'identification d'objet indirect, les induction heads, et l'arithmétique modulaire à l'intérieur des Transformers.

Deep Dive

The field draws on a key observation: neural networks don't store information in individual neurons (usually). Instead, they use superposition — many features are encoded as directions in activation space, with individual neurons participating in many features simultaneously. A neuron that seems to respond to "the concept of water" might actually respond to a superposition of features related to liquids, transparency, flow, and specific contexts. Disentangling these superposed features is one of the field's central challenges.

Sparse Autoencoders

One of the most promising recent tools is the sparse autoencoder (SAE). You train an autoencoder to reconstruct a model's internal activations, but with a sparsity constraint that forces it to use only a few features at a time. The learned features often correspond to interpretable concepts — a feature for "code comments," one for "French text," one for "mathematical reasoning." Anthropic published influential work using SAEs to find interpretable features in Claude, identifying millions of features including ones for deception, specific concepts, and language patterns.

From Features to Circuits

Beyond individual features, mechanistic interpretability tries to trace circuits: how does information flow through the network to produce a specific behavior? For example, "induction heads" are two-attention-head circuits that implement in-context learning by pattern-matching: if the model sees "A B ... A" it predicts B. These circuits have been found in models from 2-layer toy Transformers to full-scale LLMs. Understanding circuits at scale remains an open challenge, but progress is accelerating.

Mechanistic Interpretability

Pourquoi c'est important

Deep Dive

Sparse Autoencoders

From Features to Circuits

Concepts liés