Activation Function: Definition & Meaning — AI Wiki

应用到神经元输出上、给网络引入非线性的数学函数。没有激活函数,一个神经网络 — 无论多深 — 只能学到线性关系。ReLU、GELU 和 SiLU/Swish 是现代架构中最常见的。

为什么重要

激活函数是深度学习能工作的原因。一堆线性变换叠起来还是一个大的线性变换。层之间的激活函数让网络能学到复杂的、非线性的模式 — 那些让神经网络强大的曲线、边缘和微妙关系。

Deep Dive

ReLU (Rectified Linear Unit) is the simplest: f(x) = max(0, x). It outputs zero for negative inputs and passes positive inputs unchanged. ReLU solved the vanishing gradient problem that plagued earlier activation functions (sigmoid, tanh) by providing a constant gradient of 1 for positive inputs. Its simplicity and effectiveness made it the default for over a decade.

Beyond ReLU

GELU (Gaussian Error Linear Unit) is now the standard in Transformers (used by BERT, GPT, and most LLMs). Unlike ReLU's hard cutoff at zero, GELU smoothly tapers near zero, which provides better gradient flow. SiLU/Swish (x · sigmoid(x)) is similar and used in some architectures like LLaMA. The practical differences between GELU and SiLU are small — both outperform ReLU in Transformer-scale models.

GLU Variants

Modern LLMs often use Gated Linear Units (GLU) and their variants (SwiGLU, GeGLU) in feed-forward layers. These multiply two parallel linear projections together, effectively letting the network gate what information passes through. SwiGLU (used in LLaMA, Mistral, and many others) combines SiLU activation with gating and consistently improves over standard feed-forward layers at the cost of slightly more parameters.

Activation Function

为什么重要

Deep Dive

Beyond ReLU

GLU Variants

相关概念