Residual Connection: Definition & Meaning — AI Wiki

通过把输入直接加到输出来绕过一个或多个层的连接:output = layer(x) + x。不是每层学完整的变换,而只需要学“残差” — 与恒等函数的差异。残差连接在每个 Transformer 层里,对训练深度网络必不可少。

为什么重要

没有残差连接,深度网络几乎不可能训练 — 梯度在许多层上消失或爆炸。残差连接提供一条梯度高速公路,让信息(和梯度)从早期层直接流到后期层,绕过任意数量的中间变换。这就是为什么我们能训练 100+ 层的网络。

Deep Dive

Introduced in ResNet (He et al., 2015), residual connections solved the "degradation problem": deeper networks performed worse than shallow ones, not because of overfitting but because optimization became harder. The insight: it's easier to learn f(x) = 0 (the residual is nothing, just pass the input through) than to learn f(x) = x (reproduce the input perfectly). Residual connections make the identity function the default, and each layer only needs to learn useful modifications.

In Transformers

Every Transformer layer applies two residual connections: one around the attention sub-layer (x + attention(x)) and one around the feedforward sub-layer (x + ffn(x)). This means the input to layer 1 has a direct additive path to the output of layer 32 — it's added back at every step. This "residual stream" is a central concept in mechanistic interpretability: each layer reads from and writes to this shared stream, and the final output is the sum of all layers' contributions.

The Residual Stream View

Thinking of a Transformer as a residual stream with layers that read and write to it (rather than a sequential pipeline) changes how you understand the architecture. Attention layers move information between positions in the stream. FFN layers transform information at each position. The final output is the original input plus all the modifications from all layers. This view explains why you can often remove layers with limited impact — the residual stream preserves information even when individual layers are skipped.

Residual Connection

为什么重要

Deep Dive

In Transformers

The Residual Stream View

相关概念