Zubnet AILearnWiki › Logits
Fundamentals

Logits

Raw Scores, Pre-Softmax Outputs
The raw, unnormalized scores that a model outputs before they're converted into probabilities by the softmax function. For a language model, the logits are a vector with one value per token in the vocabulary — higher values indicate tokens the model considers more likely. Logits are the most informative output a model produces, containing more information than the final probability distribution.

Why it matters

Understanding logits helps you understand how models "think." Temperature, top-p, and top-k sampling all operate on logits. Classifier-free guidance in image generation manipulates logits. Logit bias (adding offsets to specific tokens) lets you steer model behavior. If you're building AI applications beyond basic chat, you'll eventually need to work with logits directly.

Deep Dive

The model's final layer produces a vector of size V (vocabulary size, typically 32K–128K). Each element is a logit for that token. Softmax converts these to probabilities: P(token_i) = exp(logit_i) / ∑ exp(logit_j). Before softmax, the logits can be any real number — positive, negative, or zero. A logit of 10 vs. 5 means the model considers the first token about e^5 ≈ 150x more likely.

Logit Manipulation

Several techniques work directly on logits. Temperature divides all logits by T before softmax (T<1 sharpens, T>1 flattens). Top-k zeroes out all logits except the k highest. Top-p (nucleus sampling) zeroes out logits for tokens outside the smallest set whose cumulative probability exceeds p. Logit bias adds a fixed offset to specific tokens' logits — adding +10 to the logit for "JSON" makes the model strongly prefer starting with JSON. Repetition penalty reduces logits of recently generated tokens.

Log-Probabilities

Most APIs can return log-probabilities (log of the softmax output) alongside generated tokens. These are useful for: measuring model confidence (low log-prob = uncertain), calibrating outputs (are 90%-confident predictions correct 90% of the time?), and building classifiers from LLMs (compare log-probs of different completions). Log-probs are more numerically stable than raw probabilities for extreme values.

Related Concepts

← All Terms
← llama.cpp LoRA →