Sigmoid: Definition & Meaning — AI Wiki

一個把任何實數擠壓到 (0, 1) 區間的數學函數:σ(x) = 1 / (1 + e^(−x))。歷史上是神經網路中預設的激活函數,現在在隱藏層中很大程度上被 ReLU 和 GELU 取代,但仍然用在二元分類輸出、gating 機制(LSTM 和 GLU 中)和需要 0 到 1 之間值的類 attention 操作中。

為什麼重要

Sigmoid 在 AI 中無處不在,即使它不再是預設的隱藏激活。LSTM 的 gates 用 sigmoid。SiLU/Swish 激活是 x · sigmoid(x)。二元分類器用 sigmoid 作為輸出激活。理解 sigmoid — 以及它為什麼在隱藏層被 ReLU 取代 — 是理解神經網路設計選擇的基礎知識。

Deep Dive

Sigmoid's shape: it's an S-curve centered at 0. For large positive inputs, it saturates near 1. For large negative inputs, it saturates near 0. Around 0, it transitions smoothly. This shape made it a natural choice for early neural networks: it mimics a biological neuron's firing rate (off to on) and naturally produces bounded outputs.

Why It Was Replaced

Sigmoid has two problems for deep networks. First, vanishing gradients: in the saturated regions (very positive or very negative inputs), the gradient is near zero, meaning learning effectively stops for those neurons. Second, non-zero-centered outputs: sigmoid always outputs positive values, which causes gradients to be either all positive or all negative, slowing convergence. ReLU solves both: it has a constant gradient of 1 for positive inputs and is zero-centered (for positive inputs).

Where Sigmoid Survives

Sigmoid remains the right choice when you specifically need a (0, 1) output: binary classification (probability of the positive class), gating (how much to let through, as in LSTMs), and any operation where you need a smooth, bounded activation. The SiLU activation function (x · sigmoid(x)) brings sigmoid back into modern architectures in a gating role, combining sigmoid's smoothness with the identity function's gradient properties.

Sigmoid

為什麼重要

Deep Dive

Why It Was Replaced

Where Sigmoid Survives

相關概念