Sigmoid: Definition & Meaning — AI Wiki

一个把任何实数挤压到 (0, 1) 区间的数学函数:σ(x) = 1 / (1 + e^(−x))。历史上是神经网络中默认的激活函数,现在在隐藏层中很大程度上被 ReLU 和 GELU 取代,但仍然用在二元分类输出、gating 机制(LSTM 和 GLU 中)和需要 0 到 1 之间值的类 attention 操作中。

为什么重要

Sigmoid 在 AI 中无处不在,即使它不再是默认的隐藏激活。LSTM 的 gates 用 sigmoid。SiLU/Swish 激活是 x · sigmoid(x)。二元分类器用 sigmoid 作为输出激活。理解 sigmoid — 以及它为什么在隐藏层被 ReLU 取代 — 是理解神经网络设计选择的基础知识。

Deep Dive

Sigmoid's shape: it's an S-curve centered at 0. For large positive inputs, it saturates near 1. For large negative inputs, it saturates near 0. Around 0, it transitions smoothly. This shape made it a natural choice for early neural networks: it mimics a biological neuron's firing rate (off to on) and naturally produces bounded outputs.

Why It Was Replaced

Sigmoid has two problems for deep networks. First, vanishing gradients: in the saturated regions (very positive or very negative inputs), the gradient is near zero, meaning learning effectively stops for those neurons. Second, non-zero-centered outputs: sigmoid always outputs positive values, which causes gradients to be either all positive or all negative, slowing convergence. ReLU solves both: it has a constant gradient of 1 for positive inputs and is zero-centered (for positive inputs).

Where Sigmoid Survives

Sigmoid remains the right choice when you specifically need a (0, 1) output: binary classification (probability of the positive class), gating (how much to let through, as in LSTMs), and any operation where you need a smooth, bounded activation. The SiLU activation function (x · sigmoid(x)) brings sigmoid back into modern architectures in a gating role, combining sigmoid's smoothness with the identity function's gradient properties.

Sigmoid

为什么重要

Deep Dive

Why It Was Replaced

Where Sigmoid Survives

相关概念