RNN: Definition & Meaning — AI Wiki

एक neural network जो sequences को एक hidden state maintain करके process करता है जो हर step पर update होता है — वो “याद रखता है” कि उसने अब तक क्या देखा है। LSTMs और GRUs improved variants हैं जो original RNN की long-range dependencies भूलने की tendency को solve करते हैं। Transformers ने 2018–2020 के आसपास उन्हें replace करने से पहले, RNNs NLP और speech में हावी थे।

यह क्यों matter करता है

RNNs modern language models के ancestors हैं। उन्होंने क्यों fail किया (slow sequential processing, long-range dependencies में difficulty) ये समझना explain करता है कि Transformers क्यों succeed हुए (parallel processing, सभी positions पर attention)। SSM/Mamba architecture किसी तरह से RNN की idea पर modern fixes के साथ वापसी है।

Deep Dive

An RNN processes a sequence token by token, updating its hidden state at each step: h_t = f(h_{t-1}, x_t). The hidden state is a compressed representation of everything seen so far. The problem: as sequences get longer, the hidden state must compress more and more information into a fixed-size vector, and gradient signals for early tokens vanish during backpropagation (the "vanishing gradient problem").

LSTM and GRU

Long Short-Term Memory (LSTM, 1997) and Gated Recurrent Units (GRU, 2014) addressed vanishing gradients by introducing gates — learned mechanisms that control what information to keep, update, or forget. LSTMs have a separate cell state that can carry information unchanged across many steps, with gates controlling access. GRUs simplify LSTMs by merging the cell and hidden states while maintaining similar performance.

Why Transformers Won

RNNs process tokens sequentially — token 5 can't be processed until tokens 1–4 are done. This makes them inherently slow on parallel hardware (GPUs). Transformers process all tokens simultaneously using attention, making them dramatically faster to train. Attention also directly connects every token to every other token, solving the long-range dependency problem without relying on a compressed hidden state. The trade-off: Transformers use quadratic memory in sequence length, while RNNs use constant memory. This is why SSMs (Mamba) are interesting — they offer RNN-like efficiency with Transformer-like performance.

RNN

यह क्यों matter करता है

Deep Dive

LSTM and GRU

Why Transformers Won

संबंधित अवधारणाएँ