Zubnet AI学习Wiki › Mechanistic Interpretability
Safety

Mechanistic Interpretability

Mech Interp, MI
一种研究方法,试图在单个神经元、电路、特征的层面理解神经网络内部发生了什么 — 不只是模型产出什么,而是它如何计算这些输出。目标是逆向工程神经网络学到的算法,就像你逆向工程编译软件来理解源码。

为什么重要

如果我们要把重要决定信任给 AI,我们需要理解它怎么做这些决定。机械可解释性是对此最严谨的尝试 — 不只是问“模型做了什么?”,而是“它实现了什么算法,为什么?”。它是 AI 安全研究的核心,尤其是在 Anthropic,并正在产生真实结果:研究者已经在 Transformer 内部识别出了间接宾语识别、induction head、模运算的电路。

Deep Dive

The field draws on a key observation: neural networks don't store information in individual neurons (usually). Instead, they use superposition — many features are encoded as directions in activation space, with individual neurons participating in many features simultaneously. A neuron that seems to respond to "the concept of water" might actually respond to a superposition of features related to liquids, transparency, flow, and specific contexts. Disentangling these superposed features is one of the field's central challenges.

Sparse Autoencoders

One of the most promising recent tools is the sparse autoencoder (SAE). You train an autoencoder to reconstruct a model's internal activations, but with a sparsity constraint that forces it to use only a few features at a time. The learned features often correspond to interpretable concepts — a feature for "code comments," one for "French text," one for "mathematical reasoning." Anthropic published influential work using SAEs to find interpretable features in Claude, identifying millions of features including ones for deception, specific concepts, and language patterns.

From Features to Circuits

Beyond individual features, mechanistic interpretability tries to trace circuits: how does information flow through the network to produce a specific behavior? For example, "induction heads" are two-attention-head circuits that implement in-context learning by pattern-matching: if the model sees "A B ... A" it predicts B. These circuits have been found in models from 2-layer toy Transformers to full-scale LLMs. Understanding circuits at scale remains an open challenge, but progress is accelerating.

相关概念

← 所有术语
← MCP Memory →