Residual Connection: Definition & Meaning — AI Wiki

透過把輸入直接加到輸出來繞過一個或多個層的連接:output = layer(x) + x。不是每層學完整的變換,而只需要學「殘差」 — 與恆等函數的差異。殘差連接在每個 Transformer 層裡,對訓練深度網路必不可少。

為什麼重要

沒有殘差連接,深度網路幾乎不可能訓練 — 梯度在許多層上消失或爆炸。殘差連接提供一條梯度高速公路,讓資訊(和梯度)從早期層直接流到後期層,繞過任意數量的中間變換。這就是為什麼我們能訓練 100+ 層的網路。

Deep Dive

Introduced in ResNet (He et al., 2015), residual connections solved the "degradation problem": deeper networks performed worse than shallow ones, not because of overfitting but because optimization became harder. The insight: it's easier to learn f(x) = 0 (the residual is nothing, just pass the input through) than to learn f(x) = x (reproduce the input perfectly). Residual connections make the identity function the default, and each layer only needs to learn useful modifications.

In Transformers

Every Transformer layer applies two residual connections: one around the attention sub-layer (x + attention(x)) and one around the feedforward sub-layer (x + ffn(x)). This means the input to layer 1 has a direct additive path to the output of layer 32 — it's added back at every step. This "residual stream" is a central concept in mechanistic interpretability: each layer reads from and writes to this shared stream, and the final output is the sum of all layers' contributions.

The Residual Stream View

Thinking of a Transformer as a residual stream with layers that read and write to it (rather than a sequential pipeline) changes how you understand the architecture. Attention layers move information between positions in the stream. FFN layers transform information at each position. The final output is the original input plus all the modifications from all layers. This view explains why you can often remove layers with limited impact — the residual stream preserves information even when individual layers are skipped.

Residual Connection

為什麼重要

Deep Dive

In Transformers

The Residual Stream View

相關概念