Zubnet AI学习Wiki › Mamba
Models

Mamba

Mamba Architecture
作为 Transformer 替代设计的 selective state space model(SSM)架构。由 Albert Gu 和 Tri Dao 创造,Mamba 以序列长度的线性 scaling(对比 Transformer 注意力的平方级代价)实现有竞争力的语言建模性能。它通过维持一个被选择性更新的压缩隐藏状态来处理序列 — 重要信息被保留,不相关信息衰减。

为什么重要

Mamba 代表对 Transformer 主导地位最可信的挑战。如果它(或它的后代)兑现了线性时间序列处理达到 Transformer 质量的承诺,意义巨大:更长的上下文窗口、更快的推理、更低的成本。“selective”那部分是关键 — 不像早期 SSM,Mamba 让它的状态转换依赖输入,这就是它有表达力匹配注意力的原因。

Deep Dive

Classical state space models maintain a fixed-size hidden state that gets updated at each timestep via learned matrices A (state transition), B (input projection), and C (output projection). Mamba's innovation is making B and C input-dependent — the model learns to selectively focus on or ignore different parts of the input based on content, not just position. This selectivity is what earlier SSMs lacked and what prevented them from matching Transformer performance on language tasks.

The Hardware Story

Mamba's other contribution is a hardware-aware implementation. The selective scan operation is rewritten to minimize memory transfers between GPU HBM and SRAM, using kernel fusion and recomputation to avoid materializing the full state expansion in memory. This engineering makes the theoretical linear complexity translate to actual wall-clock speedups, not just asymptotic improvements that get eaten by constant factors.

Mamba-2 and Hybrids

Mamba-2 simplified the architecture by showing that the selective state space model can be viewed as a structured form of attention, unifying the SSM and Transformer perspectives mathematically. This led to hybrid architectures (like Jamba from AI21, Zamba from Zyphra) that interleave Mamba layers with attention layers, getting the efficiency of SSMs for most of the sequence processing while using attention for the tasks where global token interaction is essential. The debate isn't "SSM vs. Transformer" anymore — it's about finding the optimal mix.

相关概念

← 所有术语
← Machine 学习ing Masked Language Modeling →