Superposition: Definition & Meaning — AI Wiki

神經網路編碼比它們擁有的神經元更多的特徵(概念、模式)的現象,透過把特徵表示為激活空間中的方向,而不是把單個神經元專門給單個特徵。單個神經元同時參與編碼幾十個特徵,每個特徵分佈在許多神經元上。

為什麼重要

疊加是神經網路難以解釋、機械可解釋性有挑戰的原因。如果每個神經元代表一個概念(像「狗的概念」),解釋就會直接。相反,概念以重疊模式塗抹在神經元上。理解疊加是理解神經網路如何壓縮資訊、以及它們有時為什麼行為意外的關鍵。

Deep Dive

The key insight: a model with 4096 neurons per layer can represent far more than 4096 features by using the full 4096-dimensional space. Each feature is a direction (a vector) in this space, and features can overlap as long as they're not too similar. This is mathematically analogous to compressed sensing — you can store more signals than dimensions if the signals are sparse (only a few are active at any time).

Why Models Do This

Models learn superposition because the world has more features than any practical model has dimensions. A model needs to represent thousands of concepts (colors, emotions, syntax rules, factual knowledge, code patterns), but might only have 4096 dimensions per layer. Superposition lets it pack all these features into the available space, at the cost of some interference when multiple overlapping features activate simultaneously.

Implications for Safety

Superposition has direct implications for AI safety. If a "deception" feature is superimposed with other benign features, it's hard to detect and remove. Sparse autoencoders (used in mechanistic interpretability) try to disentangle superposition by finding the individual feature directions, but the number of features in a large model may be enormous — Anthropic identified millions of interpretable features in Claude. Understanding and controlling superposition is a central challenge for making AI systems reliably safe.

Superposition

為什麼重要

Deep Dive

Why Models Do This

Implications for Safety

相關概念