Zubnet AILearnWiki › Autoregressive
Fundamentals

Autoregressive

Autoregressive Model, Next-Token Prediction
A model that generates output one token at a time, where each new token is predicted based on all the tokens that came before it. Every modern LLM — Claude, GPT, Llama, Gemini — is autoregressive. The model doesn't "plan" a full response and then write it; it literally predicts the next word, appends it, then predicts the next, over and over until it decides to stop.

Why it matters

Understanding autoregressive generation explains most LLM behaviors: why responses stream token by token, why models sometimes contradict themselves mid-paragraph, why longer outputs are slower and more expensive, and why you can't easily ask a model to "go back and fix the beginning." The model is always moving forward, one token at a time.

Deep Dive

Autoregressive generation sounds simple — predict the next token, repeat — but the implications run deep. The model produces a probability distribution over its entire vocabulary at each step. The token that gets selected depends on sampling parameters like temperature and top-p. At temperature 0, the model always picks the highest-probability token (greedy decoding). At higher temperatures, lower-probability tokens have a real chance of being selected, which is where creativity and variety come from.

Why It's Slow

During input processing, the model can process all your prompt tokens in parallel — this is called the "prefill" phase. But during generation, each new token requires a full forward pass through the entire model, and that pass can't start until the previous token is decided. This sequential bottleneck is why output generation is much slower than input processing, and why output tokens cost more. A 1000-token response requires 1000 serial forward passes, regardless of how many GPUs you have.

The Consequences of Forward-Only

Because the model can only move forward, it can't revise earlier tokens based on later insights. If it starts a sentence with "There are three reasons:" and then realizes there are actually four, it can't go back — it has to either awkwardly squeeze in a fourth or pretend there were only three. This is why chain-of-thought prompting helps: by asking the model to think before answering, you give it a chance to work through the problem before committing to a final answer. The "thinking" tokens become scaffolding that shapes the answer tokens that follow.

Alternatives Exist

Not all generative models are autoregressive. Diffusion models (used for images) generate everything at once and iteratively refine. Some research explores non-autoregressive text generation, where the model predicts all tokens simultaneously and then iterates. But for text, autoregressive remains dominant because language has a strong left-to-right (or right-to-left) sequential structure that autoregressive models exploit naturally. The question isn't whether autoregressive will be replaced, but whether hybrid approaches can get the best of both worlds.

Related Concepts

← All Terms
← Autonomous Agent Backpropagation →