Sigmoid's shape: it's an S-curve centered at 0. For large positive inputs, it saturates near 1. For large negative inputs, it saturates near 0. Around 0, it transitions smoothly. This shape made it a natural choice for early neural networks: it mimics a biological neuron's firing rate (off to on) and naturally produces bounded outputs.
Sigmoid has two problems for deep networks. First, vanishing gradients: in the saturated regions (very positive or very negative inputs), the gradient is near zero, meaning learning effectively stops for those neurons. Second, non-zero-centered outputs: sigmoid always outputs positive values, which causes gradients to be either all positive or all negative, slowing convergence. ReLU solves both: it has a constant gradient of 1 for positive inputs and is zero-centered (for positive inputs).
Sigmoid remains the right choice when you specifically need a (0, 1) output: binary classification (probability of the positive class), gating (how much to let through, as in LSTMs), and any operation where you need a smooth, bounded activation. The SiLU activation function (x · sigmoid(x)) brings sigmoid back into modern architectures in a gating role, combining sigmoid's smoothness with the identity function's gradient properties.