Zubnet AILearnWiki › Softmax
Fundamentals

Softmax

Softmax Function, Normalized Exponentials
A function that converts a vector of raw numbers (logits) into a probability distribution — all values become positive and sum to 1. Softmax amplifies the differences between values: the largest input gets the highest probability, and smaller inputs get exponentially smaller probabilities. It appears in attention mechanisms, classification outputs, and token prediction.

Why it matters

Softmax is everywhere in modern AI. Every time a language model predicts the next token, softmax converts raw model outputs into probabilities. Every attention head uses softmax to compute attention weights. Every classifier uses softmax to produce class probabilities. Understanding softmax helps you understand temperature, top-p sampling, and why models are "confident" even when wrong.

Deep Dive

The formula: softmax(x_i) = exp(x_i) / ∑ exp(x_j). The exponential amplifies differences: if one logit is 10 and another is 5, the ratio after softmax isn't 2:1 but roughly 150:1. This winner-take-most behavior is why models tend to be confident — softmax naturally produces peaked distributions rather than uniform ones.

Temperature and Softmax

Temperature is applied by dividing logits before softmax: softmax(x_i / T). Temperature T=1 is standard. T<1 sharpens the distribution (more confident, more deterministic). T>1 flattens it (more uniform, more random). This is exactly how the "temperature" parameter in LLM APIs works — it's a scalar applied to the logits before the final softmax that selects the next token.

Numerical Stability

A practical implementation detail: computing exp(x) for large values of x causes overflow. The standard fix is to subtract the maximum value from all logits before applying softmax: softmax(x_i - max(x)). This doesn't change the output (the subtracted constant cancels in the ratio) but keeps the numbers in a manageable range. Every production softmax implementation does this, and it's the kind of detail that matters when building from scratch.

Related Concepts

← All Terms
← Slop Sparse Attention →