Sparse Autoencoder: Definition & Meaning — AI Wiki

Un réseau de neurones entraîné à reconstruire les activations internes d'un modèle à travers un goulot d'étranglement avec une contrainte de sparsité — seulement quelques features peuvent être actives à la fois. Les features apprises correspondent souvent à des concepts interprétables (sujets spécifiques, patterns linguistiques, stratégies de raisonnement), faisant des SAE l'outil principal pour démêler les features superposés à l'intérieur des grands modèles de langage.

Pourquoi c'est important

Les sparse autoencoders sont le microscope de l'interprétabilité mécanistique. Les LLM empaquètent des milliers de features dans chaque couche via la superposition, rendant les neurones individuels non interprétables. Les SAE décomposent ces représentations superposées en features individuels et interprétables. Anthropic a utilisé les SAE pour identifier des millions de features dans Claude, incluant des features pour la tromperie, des concepts spécifiques et des comportements pertinents à la sécurité.

Deep Dive

Architecture: the SAE takes a model's activation vector (dimension d_model, e.g., 4096) and encodes it into a much larger sparse representation (e.g., 64K features, of which only ~100 are active for any given input). It then decodes back to d_model and is trained to minimize reconstruction error. The sparsity constraint (L1 penalty on the hidden layer) forces the SAE to use only a few features per input, ensuring each feature is specific rather than diffuse.

What SAEs Find

When trained on LLM activations, SAEs discover interpretable features: a "Golden Gate Bridge" feature that activates on text about the bridge, a "Python code" feature, a "French language" feature, a "sycophantic agreement" feature, and so on. These features are more interpretable than individual neurons because the sparsity constraint separates overlapping concepts that neurons represent in superposition. Anthropic's research found features ranging from concrete (specific entities) to abstract (deception, uncertainty).

Applications Beyond Interpretation

SAE features can be used for more than understanding: clamping a feature to zero suppresses the corresponding behavior (deactivating a "deception" feature), while amplifying a feature strengthens it. This opens the possibility of fine-grained behavioral control without retraining. However, the technique is still experimental — feature interactions are complex, and modifying one feature can have unintended effects on others due to residual superposition.

Sparse Autoencoder

Pourquoi c'est important

Deep Dive

What SAEs Find

Applications Beyond Interpretation

Concepts liés