Superposition: Definition & Meaning — AI Wiki

神经网络编码比它们拥有的神经元更多的特征(概念、模式)的现象,通过把特征表示为激活空间中的方向,而不是把单个神经元专门给单个特征。单个神经元同时参与编码几十个特征,每个特征分布在许多神经元上。

为什么重要

叠加是神经网络难以解释、机械可解释性有挑战的原因。如果每个神经元代表一个概念(像“狗的概念”),解释就会直接。相反,概念以重叠模式涂抹在神经元上。理解叠加是理解神经网络如何压缩信息、以及它们有时为什么行为意外的关键。

Deep Dive

The key insight: a model with 4096 neurons per layer can represent far more than 4096 features by using the full 4096-dimensional space. Each feature is a direction (a vector) in this space, and features can overlap as long as they're not too similar. This is mathematically analogous to compressed sensing — you can store more signals than dimensions if the signals are sparse (only a few are active at any time).

Why Models Do This

Models learn superposition because the world has more features than any practical model has dimensions. A model needs to represent thousands of concepts (colors, emotions, syntax rules, factual knowledge, code patterns), but might only have 4096 dimensions per layer. Superposition lets it pack all these features into the available space, at the cost of some interference when multiple overlapping features activate simultaneously.

Implications for Safety

Superposition has direct implications for AI safety. If a "deception" feature is superimposed with other benign features, it's hard to detect and remove. Sparse autoencoders (used in mechanistic interpretability) try to disentangle superposition by finding the individual feature directions, but the number of features in a large model may be enormous — Anthropic identified millions of interpretable features in Claude. Understanding and controlling superposition is a central challenge for making AI systems reliably safe.

Superposition

为什么重要

Deep Dive

Why Models Do This

Implications for Safety

相关概念