Zubnet AI学习Wiki › SwiGLU
基础

SwiGLU

Gated Linear Unit, GLU Variants
现代 Transformer 前馈层中使用的门控激活函数。SwiGLU 把 SiLU/Swish 激活和门控机制结合:SwiGLU(x) = (x · W1 · SiLU) ⊗ (x · W3),其中 ⊗ 是逐元素乘法。这让网络学习传递什么信息,一直胜过标准 ReLU 或 GELU 前馈层。

为什么重要

SwiGLU 是 LLaMA、Mistral、Qwen、Gemma 和大多数现代 LLM 使用的前馈激活。理解它帮你读模型架构,解释为什么现代 FFN 层有三个权重矩阵而不是两个。这是对模型质量影响不成比例的小架构选择。

Deep Dive

Standard FFN: FFN(x) = W2 · GELU(W1 · x). Two weight matrices, one activation. SwiGLU FFN: SwiGLU(x) = W2 · (SiLU(W1 · x) ⊗ W3 · x). Three weight matrices, a gating mechanism. The gate (W3 · x) controls what passes through, letting the network selectively suppress or amplify different features. To keep parameter count constant, the intermediate dimension is typically reduced from 4×model_dim to (8/3)×model_dim.

Why Gating Helps

Gating gives the network a multiplicative interaction that standard activations lack. Standard activations apply a fixed non-linearity. Gating applies a learned, input-dependent non-linearity. This additional expressiveness helps the network learn more complex functions per layer, which means you need fewer layers (or smaller layers) for equivalent performance. Shazeer (2020) showed that GLU variants consistently outperform standard FFN across model sizes.

The GLU Family

SwiGLU is one of several GLU variants: GeGLU (uses GELU instead of SiLU), ReGLU (uses ReLU), and the original GLU (uses sigmoid). SwiGLU and GeGLU perform similarly and both outperform ReGLU. The choice between them is mostly empirical — SwiGLU has become the default through convention (LLaMA adopted it, others followed) rather than clear theoretical superiority over GeGLU.

相关概念

← 所有术语
← Supervised 学习ing Sycophancy →