Zubnet AIApprendreWiki › Autoregressive
Fondamentaux

Autoregressive

Autoregressive Model, Next-Token Prediction
Un modèle qui génère la sortie un token à la fois, où chaque nouveau token est prédit à partir de tous les tokens qui sont venus avant. Chaque LLM moderne — Claude, GPT, Llama, Gemini — est autoregressif. Le modèle ne « planifie » pas une réponse complète et ne l'écrit pas ; il prédit littéralement le prochain mot, l'ajoute, puis prédit le suivant, encore et encore jusqu'à ce qu'il décide d'arrêter.

Pourquoi c'est important

Comprendre la génération autoregressive explique la plupart des comportements LLM : pourquoi les réponses streament token par token, pourquoi les modèles parfois se contredisent à mi-paragraphe, pourquoi les sorties plus longues sont plus lentes et plus chères, et pourquoi tu peux pas facilement demander à un modèle de « revenir en arrière et corriger le début ». Le modèle avance toujours, un token à la fois.

Deep Dive

Autoregressive generation sounds simple — predict the next token, repeat — but the implications run deep. The model produces a probability distribution over its entire vocabulary at each step. The token that gets selected depends on sampling parameters like temperature and top-p. At temperature 0, the model always picks the highest-probability token (greedy decoding). At higher temperatures, lower-probability tokens have a real chance of being selected, which is where creativity and variety come from.

Why It's Slow

During input processing, the model can process all your prompt tokens in parallel — this is called the "prefill" phase. But during generation, each new token requires a full forward pass through the entire model, and that pass can't start until the previous token is decided. This sequential bottleneck is why output generation is much slower than input processing, and why output tokens cost more. A 1000-token response requires 1000 serial forward passes, regardless of how many GPUs you have.

The Consequences of Forward-Only

Because the model can only move forward, it can't revise earlier tokens based on later insights. If it starts a sentence with "There are three reasons:" and then realizes there are actually four, it can't go back — it has to either awkwardly squeeze in a fourth or pretend there were only three. This is why chain-of-thought prompting helps: by asking the model to think before answering, you give it a chance to work through the problem before committing to a final answer. The "thinking" tokens become scaffolding that shapes the answer tokens that follow.

Alternatives Exist

Not all generative models are autoregressive. Diffusion models (used for images) generate everything at once and iteratively refine. Some research explores non-autoregressive text generation, where the model predicts all tokens simultaneously and then iterates. But for text, autoregressive remains dominant because language has a strong left-to-right (or right-to-left) sequential structure that autoregressive models exploit naturally. The question isn't whether autoregressive will be replaced, but whether hybrid approaches can get the best of both worlds.

Concepts liés

← Tous les termes
← Autonomous Agent Backpropagation →