Zubnet AI学习Wiki › Sparse Autoencoder
Models

Sparse Autoencoder

SAE
一个神经网络,训练来通过带稀疏约束的瓶颈重建模型的内部激活 — 一次只能有少数特征激活。学到的特征常常对应可解释的概念(特定主题、语言模式、推理策略),让 SAE 成为解开大语言模型内部叠加特征的主要工具。

为什么重要

稀疏自编码器是机械可解释性的显微镜。LLM 通过叠加把每一层里数千个特征打包,让单个神经元无法解释。SAE 把这些叠加的表示分解成单独的、可解释的特征。Anthropic 用 SAE 在 Claude 里识别出数百万特征,包括欺骗的特征、特定概念、安全相关行为。

Deep Dive

Architecture: the SAE takes a model's activation vector (dimension d_model, e.g., 4096) and encodes it into a much larger sparse representation (e.g., 64K features, of which only ~100 are active for any given input). It then decodes back to d_model and is trained to minimize reconstruction error. The sparsity constraint (L1 penalty on the hidden layer) forces the SAE to use only a few features per input, ensuring each feature is specific rather than diffuse.

What SAEs Find

When trained on LLM activations, SAEs discover interpretable features: a "Golden Gate Bridge" feature that activates on text about the bridge, a "Python code" feature, a "French language" feature, a "sycophantic agreement" feature, and so on. These features are more interpretable than individual neurons because the sparsity constraint separates overlapping concepts that neurons represent in superposition. Anthropic's research found features ranging from concrete (specific entities) to abstract (deception, uncertainty).

Applications Beyond Interpretation

SAE features can be used for more than understanding: clamping a feature to zero suppresses the corresponding behavior (deactivating a "deception" feature), while amplifying a feature strengthens it. This opens the possibility of fine-grained behavioral control without retraining. However, the technique is still experimental — feature interactions are complex, and modifying one feature can have unintended effects on others due to residual superposition.

相关概念

← 所有术语
← Sparse Attention Speculative Decoding →