RNN: Definition & Meaning — AI Wiki

一种通过维持一个隐藏状态(每一步都更新)来处理序列的神经网络 — 它“记住”目前为止见过的东西。LSTM 和 GRU 是改进的变体,解决了原始 RNN 忘记长距离依赖的倾向。RNN 在 2018–2020 年被 Transformer 取代之前,主导着 NLP 和语音识别。

为什么重要

RNN 是现代语言模型的祖先。理解它们为什么失败(缓慢的顺序处理、长距离依赖的困难),就解释了 Transformer 为什么成功(并行处理、跨所有位置的 attention)。SSM/Mamba 架构某种程度上是回到 RNN 的想法,加上现代修正。

Deep Dive

An RNN processes a sequence token by token, updating its hidden state at each step: h_t = f(h_{t-1}, x_t). The hidden state is a compressed representation of everything seen so far. The problem: as sequences get longer, the hidden state must compress more and more information into a fixed-size vector, and gradient signals for early tokens vanish during backpropagation (the "vanishing gradient problem").

LSTM and GRU

Long Short-Term Memory (LSTM, 1997) and Gated Recurrent Units (GRU, 2014) addressed vanishing gradients by introducing gates — learned mechanisms that control what information to keep, update, or forget. LSTMs have a separate cell state that can carry information unchanged across many steps, with gates controlling access. GRUs simplify LSTMs by merging the cell and hidden states while maintaining similar performance.

Why Transformers Won

RNNs process tokens sequentially — token 5 can't be processed until tokens 1–4 are done. This makes them inherently slow on parallel hardware (GPUs). Transformers process all tokens simultaneously using attention, making them dramatically faster to train. Attention also directly connects every token to every other token, solving the long-range dependency problem without relying on a compressed hidden state. The trade-off: Transformers use quadratic memory in sequence length, while RNNs use constant memory. This is why SSMs (Mamba) are interesting — they offer RNN-like efficiency with Transformer-like performance.

RNN

为什么重要

Deep Dive

LSTM and GRU

Why Transformers Won

相关概念