Sigmoid: Definition & Meaning — AI Wiki

A mathematical function that squashes any real number into the range (0, 1): σ(x) = 1 / (1 + e^(−x)). Historically the default activation function in neural networks, now largely replaced by ReLU and GELU for hidden layers but still used for binary classification outputs, gating mechanisms (in LSTMs and GLU), and attention-like operations where you need values between 0 and 1.

Why it matters

Sigmoid appears everywhere in AI even though it's no longer the default hidden activation. LSTM gates use sigmoid. The SiLU/Swish activation is x · sigmoid(x). Binary classifiers use sigmoid as the output activation. Understanding sigmoid — and why it was replaced by ReLU for hidden layers — is foundational knowledge for understanding neural network design choices.

Deep Dive

Sigmoid's shape: it's an S-curve centered at 0. For large positive inputs, it saturates near 1. For large negative inputs, it saturates near 0. Around 0, it transitions smoothly. This shape made it a natural choice for early neural networks: it mimics a biological neuron's firing rate (off to on) and naturally produces bounded outputs.

Why It Was Replaced

Sigmoid has two problems for deep networks. First, vanishing gradients: in the saturated regions (very positive or very negative inputs), the gradient is near zero, meaning learning effectively stops for those neurons. Second, non-zero-centered outputs: sigmoid always outputs positive values, which causes gradients to be either all positive or all negative, slowing convergence. ReLU solves both: it has a constant gradient of 1 for positive inputs and is zero-centered (for positive inputs).

Where Sigmoid Survives

Sigmoid remains the right choice when you specifically need a (0, 1) output: binary classification (probability of the positive class), gating (how much to let through, as in LSTMs), and any operation where you need a smooth, bounded activation. The SiLU activation function (x · sigmoid(x)) brings sigmoid back into modern architectures in a gating role, combining sigmoid's smoothness with the identity function's gradient properties.

Sigmoid

Why it matters

Deep Dive

Why It Was Replaced

Where Sigmoid Survives

Related Concepts