Zubnet AI學習Wiki › Sparse Autoencoder
Models

Sparse Autoencoder

SAE
一個神經網路,訓練來透過帶稀疏約束的瓶頸重建模型的內部激活 — 一次只能有少數特徵激活。學到的特徵常常對應可解釋的概念(特定主題、語言模式、推理策略),讓 SAE 成為解開大語言模型內部疊加特徵的主要工具。

為什麼重要

稀疏自編碼器是機械可解釋性的顯微鏡。LLM 透過疊加把每一層裡數千個特徵打包,讓單個神經元無法解釋。SAE 把這些疊加的表示分解成單獨的、可解釋的特徵。Anthropic 用 SAE 在 Claude 裡辨識出數百萬特徵,包括欺騙的特徵、特定概念、安全相關行為。

Deep Dive

Architecture: the SAE takes a model's activation vector (dimension d_model, e.g., 4096) and encodes it into a much larger sparse representation (e.g., 64K features, of which only ~100 are active for any given input). It then decodes back to d_model and is trained to minimize reconstruction error. The sparsity constraint (L1 penalty on the hidden layer) forces the SAE to use only a few features per input, ensuring each feature is specific rather than diffuse.

What SAEs Find

When trained on LLM activations, SAEs discover interpretable features: a "Golden Gate Bridge" feature that activates on text about the bridge, a "Python code" feature, a "French language" feature, a "sycophantic agreement" feature, and so on. These features are more interpretable than individual neurons because the sparsity constraint separates overlapping concepts that neurons represent in superposition. Anthropic's research found features ranging from concrete (specific entities) to abstract (deception, uncertainty).

Applications Beyond Interpretation

SAE features can be used for more than understanding: clamping a feature to zero suppresses the corresponding behavior (deactivating a "deception" feature), while amplifying a feature strengthens it. This opens the possibility of fine-grained behavioral control without retraining. However, the technique is still experimental — feature interactions are complex, and modifying one feature can have unintended effects on others due to residual superposition.

相關概念

← 所有術語
← Sparse Attention Speculative Decoding →