Zubnet AILearnWiki › LSTM
Models

LSTM

Long Short-Term Memory
A type of recurrent neural network (RNN) designed to learn long-range dependencies in sequential data. LSTM introduces a "cell state" — a memory highway that can carry information unchanged across many time steps — controlled by three gates: an input gate (what to add), a forget gate (what to remove), and an output gate (what to expose). Invented in 1997, LSTM dominated sequence modeling until Transformers emerged.

Why it matters

LSTM was the backbone of NLP for a decade (2010s): machine translation, speech recognition, text generation, and sentiment analysis all ran on LSTMs. Understanding LSTM helps you understand why Transformers replaced it (parallelism and long-range attention vs. sequential processing and compressed state) and why SSMs like Mamba are interesting (they revisit the gated-state idea with modern improvements).

Deep Dive

LSTM's three gates are all small neural networks that output values between 0 (completely block) and 1 (completely pass through). The forget gate decides which cell state information to discard. The input gate decides which new information to add. The output gate decides which cell state information to expose as the hidden state. This gating mechanism lets the network learn what to remember and what to forget over long sequences — something vanilla RNNs couldn't do.

Why LSTMs Were Revolutionary

Before LSTM, RNNs suffered from vanishing gradients: information from early in a sequence couldn't influence processing of later parts because gradients decayed exponentially through time. LSTM's cell state acts as a gradient highway — it can carry gradients unchanged through hundreds of steps. This is what enabled sequence-to-sequence learning: machine translation (encode source sentence, decode target sentence), text summarization, and question answering all became practical with LSTMs.

LSTM to Transformer to SSM

LSTMs process tokens sequentially (can't parallelize during training) and compress all history into a fixed-size hidden state (information bottleneck). Transformers solve both: parallel training and direct attention to any token. But Transformers trade these gains for quadratic memory cost in sequence length. SSMs like Mamba revisit the gated-state idea: they maintain a compressed state (like LSTM) but make the gates input-dependent (selective) and hardware-efficient, getting LSTM's constant-memory advantage with Transformer-level quality.

Related Concepts

← All Terms
← Loss Function Luma AI →