The formula: softmax(x_i) = exp(x_i) / ∑ exp(x_j). The exponential amplifies differences: if one logit is 10 and another is 5, the ratio after softmax isn't 2:1 but roughly 150:1. This winner-take-most behavior is why models tend to be confident — softmax naturally produces peaked distributions rather than uniform ones.
Temperature is applied by dividing logits before softmax: softmax(x_i / T). Temperature T=1 is standard. T<1 sharpens the distribution (more confident, more deterministic). T>1 flattens it (more uniform, more random). This is exactly how the "temperature" parameter in LLM APIs works — it's a scalar applied to the logits before the final softmax that selects the next token.
A practical implementation detail: computing exp(x) for large values of x causes overflow. The standard fix is to subtract the maximum value from all logits before applying softmax: softmax(x_i - max(x)). This doesn't change the output (the subtracted constant cancels in the ratio) but keeps the numbers in a manageable range. Every production softmax implementation does this, and it's the kind of detail that matters when building from scratch.