Feedforward Network: Definition & Meaning — AI Wiki

每個 Transformer 層中獨立處理每個 token、透過兩個帶激活函數的線性變換的元件。注意力跨 token 混合資訊(哪些 token 相關),而前饋網路獨立處理每個 token 的表示,套用編碼知識並執行運算的非線性變換。

為什麼重要

前饋網路是 Transformer 大部分知識儲存的地方。注意力得到所有榮耀,但 FFN 層包含模型的大部分參數(通常占總參數的 2/3),是事實關聯、語言模式、學到的運算主要存放的地方。理解這個幫助解釋像知識編輯和模型剪枝這樣的現象。

Deep Dive

The standard FFN: FFN(x) = W2 · activation(W1 · x + b1) + b2, where W1 projects from the model dimension to a larger intermediate dimension (typically 4x), the activation function introduces non-linearity, and W2 projects back to the model dimension. Each position (token) passes through this independently — the FFN doesn't see other tokens, only the attention layer does.

SwiGLU and Gated Variants

Modern LLMs (LLaMA, Mistral, etc.) use SwiGLU instead of the standard FFN: SwiGLU(x) = (W1 · x · SiLU) ⊗ (W3 · x). This adds a third weight matrix (W3) and a gating mechanism that lets the network control what information passes through. Despite the extra parameters, it performs better at equivalent compute, so the intermediate dimension is adjusted down to compensate. This is a case where a slightly more complex component improves the whole system.

Knowledge Storage

Research suggests that FFN layers function like key-value memories: the first linear layer (W1) detects patterns in the input (keys), and the second linear layer (W2) maps those patterns to output updates (values). "The Eiffel Tower is in" activates specific neurons in W1, which through W2 promote the token "Paris." This key-value interpretation explains why FFN layers store factual knowledge and why knowledge editing techniques can modify specific facts by updating specific FFN weights.

Feedforward Network

為什麼重要

Deep Dive

SwiGLU and Gated Variants

Knowledge Storage

相關概念