Zubnet AI学习Wiki › Softmax
基础

Softmax

Softmax Function, Normalized Exponentials
一个把原始数字向量(logits)转换成概率分布的函数 — 所有值变成正数且加起来为 1。Softmax 放大数值之间的差异:最大的输入得到最高的概率,更小的输入得到指数级更小的概率。它出现在 attention 机制、分类输出、和 token 预测中。

为什么重要

Softmax 无处不在于现代 AI。每次一个语言模型预测下一个 token,softmax 把模型的原始输出转换成概率。每个 attention head 用 softmax 计算注意力权重。每个分类器用 softmax 产出类别概率。理解 softmax 能帮你理解 temperature、top-p 采样,以及为什么模型即使错了也“自信”。

Deep Dive

The formula: softmax(x_i) = exp(x_i) / ∑ exp(x_j). The exponential amplifies differences: if one logit is 10 and another is 5, the ratio after softmax isn't 2:1 but roughly 150:1. This winner-take-most behavior is why models tend to be confident — softmax naturally produces peaked distributions rather than uniform ones.

Temperature and Softmax

Temperature is applied by dividing logits before softmax: softmax(x_i / T). Temperature T=1 is standard. T<1 sharpens the distribution (more confident, more deterministic). T>1 flattens it (more uniform, more random). This is exactly how the "temperature" parameter in LLM APIs works — it's a scalar applied to the logits before the final softmax that selects the next token.

Numerical Stability

A practical implementation detail: computing exp(x) for large values of x causes overflow. The standard fix is to subtract the maximum value from all logits before applying softmax: softmax(x_i - max(x)). This doesn't change the output (the subtracted constant cancels in the ratio) but keeps the numbers in a manageable range. Every production softmax implementation does this, and it's the kind of detail that matters when building from scratch.

相关概念

← 所有术语
← Slop Sparse Attention →