Zubnet AI學習Wiki › SwiGLU
基礎

SwiGLU

Gated Linear Unit, GLU Variants
現代 Transformer 前饋層中使用的閘控激活函數。SwiGLU 把 SiLU/Swish 激活和閘控機制結合:SwiGLU(x) = (x · W1 · SiLU) ⊗ (x · W3),其中 ⊗ 是逐元素乘法。這讓網路學習傳遞什麼資訊,一直勝過標準 ReLU 或 GELU 前饋層。

為什麼重要

SwiGLU 是 LLaMA、Mistral、Qwen、Gemma 和大多數現代 LLM 使用的前饋激活。理解它幫你讀模型架構,解釋為什麼現代 FFN 層有三個權重矩陣而不是兩個。這是對模型品質影響不成比例的小架構選擇。

Deep Dive

Standard FFN: FFN(x) = W2 · GELU(W1 · x). Two weight matrices, one activation. SwiGLU FFN: SwiGLU(x) = W2 · (SiLU(W1 · x) ⊗ W3 · x). Three weight matrices, a gating mechanism. The gate (W3 · x) controls what passes through, letting the network selectively suppress or amplify different features. To keep parameter count constant, the intermediate dimension is typically reduced from 4×model_dim to (8/3)×model_dim.

Why Gating Helps

Gating gives the network a multiplicative interaction that standard activations lack. Standard activations apply a fixed non-linearity. Gating applies a learned, input-dependent non-linearity. This additional expressiveness helps the network learn more complex functions per layer, which means you need fewer layers (or smaller layers) for equivalent performance. Shazeer (2020) showed that GLU variants consistently outperform standard FFN across model sizes.

The GLU Family

SwiGLU is one of several GLU variants: GeGLU (uses GELU instead of SiLU), ReGLU (uses ReLU), and the original GLU (uses sigmoid). SwiGLU and GeGLU perform similarly and both outperform ReGLU. The choice between them is mostly empirical — SwiGLU has become the default through convention (LLaMA adopted it, others followed) rather than clear theoretical superiority over GeGLU.

相關概念

← 所有術語
← Supervised 學習ing Sycophancy →