Autoregressive generation sounds simple — predict the next token, repeat — but the implications run deep. The model produces a probability distribution over its entire vocabulary at each step. The token that gets selected depends on sampling parameters like temperature and top-p. At temperature 0, the model always picks the highest-probability token (greedy decoding). At higher temperatures, lower-probability tokens have a real chance of being selected, which is where creativity and variety come from.
During input processing, the model can process all your prompt tokens in parallel — this is called the "prefill" phase. But during generation, each new token requires a full forward pass through the entire model, and that pass can't start until the previous token is decided. This sequential bottleneck is why output generation is much slower than input processing, and why output tokens cost more. A 1000-token response requires 1000 serial forward passes, regardless of how many GPUs you have.
Because the model can only move forward, it can't revise earlier tokens based on later insights. If it starts a sentence with "There are three reasons:" and then realizes there are actually four, it can't go back — it has to either awkwardly squeeze in a fourth or pretend there were only three. This is why chain-of-thought prompting helps: by asking the model to think before answering, you give it a chance to work through the problem before committing to a final answer. The "thinking" tokens become scaffolding that shapes the answer tokens that follow.
Not all generative models are autoregressive. Diffusion models (used for images) generate everything at once and iteratively refine. Some research explores non-autoregressive text generation, where the model predicts all tokens simultaneously and then iterates. But for text, autoregressive remains dominant because language has a strong left-to-right (or right-to-left) sequential structure that autoregressive models exploit naturally. The question isn't whether autoregressive will be replaced, but whether hybrid approaches can get the best of both worlds.