Zubnet AI學習Wiki › Softmax
基礎

Softmax

Softmax Function, Normalized Exponentials
一個把原始數字向量(logits)轉換成機率分佈的函數 — 所有值變成正數且加起來為 1。Softmax 放大數值之間的差異:最大的輸入得到最高的機率,更小的輸入得到指數級更小的機率。它出現在 attention 機制、分類輸出、和 token 預測中。

為什麼重要

Softmax 無處不在於現代 AI。每次一個語言模型預測下一個 token,softmax 把模型的原始輸出轉換成機率。每個 attention head 用 softmax 計算注意力權重。每個分類器用 softmax 產出類別機率。理解 softmax 能幫你理解 temperature、top-p 採樣,以及為什麼模型即使錯了也「自信」。

Deep Dive

The formula: softmax(x_i) = exp(x_i) / ∑ exp(x_j). The exponential amplifies differences: if one logit is 10 and another is 5, the ratio after softmax isn't 2:1 but roughly 150:1. This winner-take-most behavior is why models tend to be confident — softmax naturally produces peaked distributions rather than uniform ones.

Temperature and Softmax

Temperature is applied by dividing logits before softmax: softmax(x_i / T). Temperature T=1 is standard. T<1 sharpens the distribution (more confident, more deterministic). T>1 flattens it (more uniform, more random). This is exactly how the "temperature" parameter in LLM APIs works — it's a scalar applied to the logits before the final softmax that selects the next token.

Numerical Stability

A practical implementation detail: computing exp(x) for large values of x causes overflow. The standard fix is to subtract the maximum value from all logits before applying softmax: softmax(x_i - max(x)). This doesn't change the output (the subtracted constant cancels in the ratio) but keeps the numbers in a manageable range. Every production softmax implementation does this, and it's the kind of detail that matters when building from scratch.

相關概念

← 所有術語
← Slop Sparse Attention →