Zubnet AIAprenderWiki › LSTM
Models

LSTM

Long Short-Term Memory
Um tipo de rede neural recorrente (RNN) projetada para aprender dependências de longo alcance em dados sequenciais. O LSTM introduz um “cell state” — uma autoestrada de memória que pode carregar informação sem alteração através de muitos time steps — controlada por três gates: uma input gate (o que adicionar), uma forget gate (o que remover) e uma output gate (o que expor). Inventado em 1997, o LSTM dominou a modelagem de sequências até os Transformers emergirem.

Por que importa

O LSTM foi a espinha dorsal do NLP por uma década (os anos 2010): tradução automática, reconhecimento de voz, geração de texto e análise de sentimento todos rodavam em LSTMs. Entender LSTM te ajuda a entender por que os Transformers o substituíram (paralelismo e atenção de longo alcance vs. processamento sequencial e estado comprimido) e por que SSMs como Mamba são interessantes (revisitam a ideia de estado gated com melhorias modernas).

Deep Dive

LSTM's three gates are all small neural networks that output values between 0 (completely block) and 1 (completely pass through). The forget gate decides which cell state information to discard. The input gate decides which new information to add. The output gate decides which cell state information to expose as the hidden state. This gating mechanism lets the network learn what to remember and what to forget over long sequences — something vanilla RNNs couldn't do.

Why LSTMs Were Revolutionary

Before LSTM, RNNs suffered from vanishing gradients: information from early in a sequence couldn't influence processing of later parts because gradients decayed exponentially through time. LSTM's cell state acts as a gradient highway — it can carry gradients unchanged through hundreds of steps. This is what enabled sequence-to-sequence learning: machine translation (encode source sentence, decode target sentence), text summarization, and question answering all became practical with LSTMs.

LSTM to Transformer to SSM

LSTMs process tokens sequentially (can't parallelize during training) and compress all history into a fixed-size hidden state (information bottleneck). Transformers solve both: parallel training and direct attention to any token. But Transformers trade these gains for quadratic memory cost in sequence length. SSMs like Mamba revisit the gated-state idea: they maintain a compressed state (like LSTM) but make the gates input-dependent (selective) and hardware-efficient, getting LSTM's constant-memory advantage with Transformer-level quality.

Conceitos relacionados

← Todos os termos
← Loss Function Luma AI →