RNN: Definition & Meaning — AI Wiki

一種透過維持一個隱藏狀態(每一步都更新)來處理序列的神經網路 — 它「記住」目前為止見過的東西。LSTM 和 GRU 是改進的變體,解決了原始 RNN 忘記長距離依賴的傾向。RNN 在 2018–2020 年被 Transformer 取代之前,主導著 NLP 和語音辨識。

為什麼重要

RNN 是現代語言模型的祖先。理解它們為什麼失敗(緩慢的循序處理、長距離依賴的困難),就解釋了 Transformer 為什麼成功(平行處理、跨所有位置的 attention)。SSM/Mamba 架構某種程度上是回到 RNN 的想法,加上現代修正。

Deep Dive

An RNN processes a sequence token by token, updating its hidden state at each step: h_t = f(h_{t-1}, x_t). The hidden state is a compressed representation of everything seen so far. The problem: as sequences get longer, the hidden state must compress more and more information into a fixed-size vector, and gradient signals for early tokens vanish during backpropagation (the "vanishing gradient problem").

LSTM and GRU

Long Short-Term Memory (LSTM, 1997) and Gated Recurrent Units (GRU, 2014) addressed vanishing gradients by introducing gates — learned mechanisms that control what information to keep, update, or forget. LSTMs have a separate cell state that can carry information unchanged across many steps, with gates controlling access. GRUs simplify LSTMs by merging the cell and hidden states while maintaining similar performance.

Why Transformers Won

RNNs process tokens sequentially — token 5 can't be processed until tokens 1–4 are done. This makes them inherently slow on parallel hardware (GPUs). Transformers process all tokens simultaneously using attention, making them dramatically faster to train. Attention also directly connects every token to every other token, solving the long-range dependency problem without relying on a compressed hidden state. The trade-off: Transformers use quadratic memory in sequence length, while RNNs use constant memory. This is why SSMs (Mamba) are interesting — they offer RNN-like efficiency with Transformer-like performance.

RNN

為什麼重要

Deep Dive

LSTM and GRU

Why Transformers Won

相關概念