Zubnet AI學習Wiki › Mechanistic Interpretability
Safety

Mechanistic Interpretability

Mech Interp, MI
一種研究方法,試圖在單個神經元、電路、特徵的層面理解神經網路內部發生了什麼 — 不只是模型產出什麼,而是它如何運算這些輸出。目標是逆向工程神經網路學到的演算法,就像你逆向工程編譯軟體來理解原始碼。

為什麼重要

如果我們要把重要決定信任給 AI,我們需要理解它怎麼做這些決定。機械可解釋性是對此最嚴謹的嘗試 — 不只是問「模型做了什麼?」,而是「它實現了什麼演算法,為什麼?」。它是 AI 安全研究的核心,尤其是在 Anthropic,並正在產生真實結果:研究者已經在 Transformer 內部辨識出了間接受詞辨識、induction head、模運算的電路。

Deep Dive

The field draws on a key observation: neural networks don't store information in individual neurons (usually). Instead, they use superposition — many features are encoded as directions in activation space, with individual neurons participating in many features simultaneously. A neuron that seems to respond to "the concept of water" might actually respond to a superposition of features related to liquids, transparency, flow, and specific contexts. Disentangling these superposed features is one of the field's central challenges.

Sparse Autoencoders

One of the most promising recent tools is the sparse autoencoder (SAE). You train an autoencoder to reconstruct a model's internal activations, but with a sparsity constraint that forces it to use only a few features at a time. The learned features often correspond to interpretable concepts — a feature for "code comments," one for "French text," one for "mathematical reasoning." Anthropic published influential work using SAEs to find interpretable features in Claude, identifying millions of features including ones for deception, specific concepts, and language patterns.

From Features to Circuits

Beyond individual features, mechanistic interpretability tries to trace circuits: how does information flow through the network to produce a specific behavior? For example, "induction heads" are two-attention-head circuits that implement in-context learning by pattern-matching: if the model sees "A B ... A" it predicts B. These circuits have been found in models from 2-layer toy Transformers to full-scale LLMs. Understanding circuits at scale remains an open challenge, but progress is accelerating.

相關概念

← 所有術語
← MCP Memory →