Zubnet AIAprenderWiki › Residual Connection
Fundamentos

Residual Connection

Skip Connection, Shortcut Connection
Una conexión que omite una o más capas agregando la entrada directamente a la salida: output = layer(x) + x. En vez de que cada capa aprenda una transformación completa, solo necesita aprender el «residual» — la diferencia de la función identidad. Las conexiones residuales están en cada capa Transformer y son esenciales para entrenar redes profundas.

Por qué importa

Sin conexiones residuales, las redes profundas son casi imposibles de entrenar — los gradientes desaparecen o explotan a través de muchas capas. Las conexiones residuales proveen una autopista de gradientes que deja que la información (y gradientes) fluya directamente de capas tempranas a tardías, eludiendo cualquier número de transformaciones intermedias. Son por qué podemos entrenar redes de 100+ capas.

Deep Dive

Introduced in ResNet (He et al., 2015), residual connections solved the "degradation problem": deeper networks performed worse than shallow ones, not because of overfitting but because optimization became harder. The insight: it's easier to learn f(x) = 0 (the residual is nothing, just pass the input through) than to learn f(x) = x (reproduce the input perfectly). Residual connections make the identity function the default, and each layer only needs to learn useful modifications.

In Transformers

Every Transformer layer applies two residual connections: one around the attention sub-layer (x + attention(x)) and one around the feedforward sub-layer (x + ffn(x)). This means the input to layer 1 has a direct additive path to the output of layer 32 — it's added back at every step. This "residual stream" is a central concept in mechanistic interpretability: each layer reads from and writes to this shared stream, and the final output is the sum of all layers' contributions.

The Residual Stream View

Thinking of a Transformer as a residual stream with layers that read and write to it (rather than a sequential pipeline) changes how you understand the architecture. Attention layers move information between positions in the stream. FFN layers transform information at each position. The final output is the original input plus all the modifications from all layers. This view explains why you can often remove layers with limited impact — the residual stream preserves information even when individual layers are skipped.

Conceptos relacionados

← Todos los términos
← Resemble AI Retrieval →