LSTM: Definition & Meaning — AI Wiki

Un tipo de red neuronal recurrente (RNN) diseñada para aprender dependencias de largo alcance en datos secuenciales. El LSTM introduce un «cell state» — una autopista de memoria que puede llevar información sin cambios a través de muchos time steps — controlada por tres gates: una input gate (qué agregar), una forget gate (qué quitar) y una output gate (qué exponer). Inventado en 1997, el LSTM dominó el modelado de secuencias hasta que emergieron los Transformers.

Por qué importa

El LSTM fue la columna vertebral del NLP por una década (los 2010s): traducción automática, reconocimiento de voz, generación de texto y análisis de sentimientos, todo corría en LSTMs. Entender el LSTM te ayuda a entender por qué los Transformers lo reemplazaron (paralelismo y atención de largo alcance vs. procesamiento secuencial y estado comprimido) y por qué los SSMs como Mamba son interesantes (revisitan la idea de estado gated con mejoras modernas).

Deep Dive

LSTM's three gates are all small neural networks that output values between 0 (completely block) and 1 (completely pass through). The forget gate decides which cell state information to discard. The input gate decides which new information to add. The output gate decides which cell state information to expose as the hidden state. This gating mechanism lets the network learn what to remember and what to forget over long sequences — something vanilla RNNs couldn't do.

Why LSTMs Were Revolutionary

Before LSTM, RNNs suffered from vanishing gradients: information from early in a sequence couldn't influence processing of later parts because gradients decayed exponentially through time. LSTM's cell state acts as a gradient highway — it can carry gradients unchanged through hundreds of steps. This is what enabled sequence-to-sequence learning: machine translation (encode source sentence, decode target sentence), text summarization, and question answering all became practical with LSTMs.

LSTM to Transformer to SSM

LSTMs process tokens sequentially (can't parallelize during training) and compress all history into a fixed-size hidden state (information bottleneck). Transformers solve both: parallel training and direct attention to any token. But Transformers trade these gains for quadratic memory cost in sequence length. SSMs like Mamba revisit the gated-state idea: they maintain a compressed state (like LSTM) but make the gates input-dependent (selective) and hardware-efficient, getting LSTM's constant-memory advantage with Transformer-level quality.

LSTM

Por qué importa

Deep Dive

Why LSTMs Were Revolutionary

LSTM to Transformer to SSM

Conceptos relacionados