Zubnet AILearnWiki › RNN
Models

RNN

Recurrent Neural Network, LSTM, GRU
A neural network that processes sequences by maintaining a hidden state that gets updated at each step — it "remembers" what it's seen so far. LSTMs and GRUs are improved variants that solve the original RNN's tendency to forget long-range dependencies. RNNs dominated NLP and speech before Transformers replaced them around 2018–2020.

Why it matters

RNNs are the ancestors of modern language models. Understanding why they failed (slow sequential processing, difficulty with long-range dependencies) explains why Transformers succeeded (parallel processing, attention over all positions). The SSM/Mamba architecture is, in some ways, a return to the RNN idea with modern fixes.

Deep Dive

An RNN processes a sequence token by token, updating its hidden state at each step: h_t = f(h_{t-1}, x_t). The hidden state is a compressed representation of everything seen so far. The problem: as sequences get longer, the hidden state must compress more and more information into a fixed-size vector, and gradient signals for early tokens vanish during backpropagation (the "vanishing gradient problem").

LSTM and GRU

Long Short-Term Memory (LSTM, 1997) and Gated Recurrent Units (GRU, 2014) addressed vanishing gradients by introducing gates — learned mechanisms that control what information to keep, update, or forget. LSTMs have a separate cell state that can carry information unchanged across many steps, with gates controlling access. GRUs simplify LSTMs by merging the cell and hidden states while maintaining similar performance.

Why Transformers Won

RNNs process tokens sequentially — token 5 can't be processed until tokens 1–4 are done. This makes them inherently slow on parallel hardware (GPUs). Transformers process all tokens simultaneously using attention, making them dramatically faster to train. Attention also directly connects every token to every other token, solving the long-range dependency problem without relying on a compressed hidden state. The trade-off: Transformers use quadratic memory in sequence length, while RNNs use constant memory. This is why SSMs (Mamba) are interesting — they offer RNN-like efficiency with Transformer-like performance.

Related Concepts

← All Terms
← RLHF Runway →