To understand temperature, you need to know what happens right before a model outputs a token. The model produces a vector of raw scores (called logits) — one score for every token in its vocabulary, which might be 32,000 to 128,000 entries. These logits are then divided by the temperature value and fed through a softmax function, which converts them into a probability distribution. When temperature is 1.0, the softmax operates on the raw logits as-is. When temperature is 0.5, the logits are effectively doubled before softmax, which makes the probability distribution sharper — the most likely token gets an even larger share of the probability. When temperature is 2.0, the logits are halved, which flattens the distribution and gives less likely tokens a better chance of being selected.
Temperature 0 is a special case that most API providers implement as greedy decoding — always pick the single highest-probability token, no sampling involved. This makes the output deterministic (or nearly so; some providers add tiny floating-point noise). It is the right choice when you want reproducible results: extracting structured data, classification tasks, factual Q&A, or anything where "creativity" is a liability. A common production pattern is to use temperature 0 for all automated pipelines and reserve higher temperatures for user-facing creative features.
Temperature interacts with another sampling parameter called top-p (nucleus sampling) in ways that trip people up. Top-p limits the token selection to the smallest set of tokens whose cumulative probability exceeds the threshold p. Setting temperature to 0.7 with top-p at 0.9 is different from temperature 1.0 with top-p at 0.7, even though both aim for "moderate randomness." Most practitioners recommend adjusting one or the other, not both simultaneously, because the interaction is hard to reason about. Anthropic's API defaults to temperature 1.0 with top-p 1.0 for Claude. OpenAI defaults to temperature 1.0 with top-p 1.0 for GPT models. If you are tweaking both at once, you are probably overcomplicating things.
The right temperature depends on the task, and the "0.7 is good for everything" advice is an oversimplification. For code generation, most developers find that 0–0.3 produces the most reliable results. For conversational assistants, 0.5–0.8 gives natural-sounding variety without going off the rails. For creative writing, brainstorming, or generating diverse options, 0.9–1.2 works well. Going above 1.5 produces increasingly incoherent output that is rarely useful in practice. Some models support temperatures above 2.0 technically, but the output quality degrades fast — it starts resembling random token soup rather than creative text.
A subtle but important point: temperature affects token-level randomness, not idea-level creativity. A higher temperature does not make the model "think more creatively" in any meaningful sense — it makes it more likely to choose unexpected words. Sometimes that produces genuinely novel combinations. Other times it just produces grammatical errors, non-sequiturs, or hallucinations. If you want genuinely different approaches to a problem, you are often better off running the same prompt multiple times at moderate temperature (say 0.8) and comparing the results, rather than cranking temperature to 1.5 and hoping for the best. This is the principle behind techniques like self-consistency and best-of-N sampling, which use moderate temperature with multiple samples to get both diversity and quality.