Activation Function: Definition & Meaning — AI Wiki

套用到神經元輸出上、給網路引入非線性的數學函數。沒有激活函數,一個神經網路 — 無論多深 — 只能學到線性關係。ReLU、GELU 和 SiLU/Swish 是現代架構中最常見的。

為什麼重要

激活函數是深度學習能運作的原因。一堆線性變換疊起來還是一個大的線性變換。層之間的激活函數讓網路能學到複雜的、非線性的模式 — 那些讓神經網路強大的曲線、邊緣和微妙關係。

Deep Dive

ReLU (Rectified Linear Unit) is the simplest: f(x) = max(0, x). It outputs zero for negative inputs and passes positive inputs unchanged. ReLU solved the vanishing gradient problem that plagued earlier activation functions (sigmoid, tanh) by providing a constant gradient of 1 for positive inputs. Its simplicity and effectiveness made it the default for over a decade.

Beyond ReLU

GELU (Gaussian Error Linear Unit) is now the standard in Transformers (used by BERT, GPT, and most LLMs). Unlike ReLU's hard cutoff at zero, GELU smoothly tapers near zero, which provides better gradient flow. SiLU/Swish (x · sigmoid(x)) is similar and used in some architectures like LLaMA. The practical differences between GELU and SiLU are small — both outperform ReLU in Transformer-scale models.

GLU Variants

Modern LLMs often use Gated Linear Units (GLU) and their variants (SwiGLU, GeGLU) in feed-forward layers. These multiply two parallel linear projections together, effectively letting the network gate what information passes through. SwiGLU (used in LLaMA, Mistral, and many others) combines SiLU activation with gating and consistently improves over standard feed-forward layers at the cost of slightly more parameters.

Activation Function

為什麼重要

Deep Dive

Beyond ReLU

GLU Variants

相關概念