Chain-of-thought prompting works because language models are next-token predictors, and the tokens they generate become part of their own context. When you ask a model to "think step by step," you are not activating some hidden reasoning module — you are forcing it to produce intermediate tokens that constrain and guide subsequent tokens toward a correct answer. Without those intermediate steps, the model has to make a single massive inferential leap from question to answer, and that is exactly where errors pile up. With CoT, each step narrows the probability space for the next one. It is the difference between trying to multiply 347 by 29 in your head all at once versus writing out the partial products on paper.
The original 2022 paper by Wei et al. at Google showed that CoT prompting was essentially free for large models — just adding "Let's think step by step" to a prompt boosted GSM8K math accuracy from around 18% to 57% on PaLM 540B. But the technique barely helped smaller models, which led to a practical rule of thumb: CoT is most useful on models above roughly 10 billion parameters. Below that threshold, the model often generates plausible-sounding but wrong reasoning steps, which actually hurts more than jumping straight to an answer. This is worth remembering if you are routing between models of different sizes in production.
Modern frontier models — Claude, GPT-4, Gemini — have largely internalized chain-of-thought during training. Anthropic and OpenAI both use variants of process reward models and reinforcement learning to train models that reason through problems before answering, even when you do not explicitly ask them to. OpenAI's o1 and o3 models take this furthest, performing extended internal reasoning that you can see in a "thinking" trace. Claude's extended thinking works similarly. The practical upshot is that for cutting-edge models, explicit CoT prompting matters less than it did in 2023, but it still helps when you want to inspect the reasoning, catch errors, or when you are working with smaller or open-source models that did not get that training.
A common misconception is that chain-of-thought always means longer, slower responses. In practice, you can combine CoT with structured output — ask the model to reason in a scratchpad section, then produce a concise final answer. Many API users put the reasoning in a separate field or use XML tags to delimit thinking from the answer. This gives you the accuracy benefits without forcing your end users to wade through paragraphs of reasoning. Another gotcha: CoT can actually make models worse on simple tasks where overthinking introduces unnecessary doubt. If you are asking "What is the capital of France?" you do not need five steps of reasoning — you need a direct answer.
The variants of CoT are worth knowing. Zero-shot CoT (just appending "think step by step") is the simplest. Few-shot CoT provides worked examples with reasoning chains in the prompt. Tree-of-thought goes further, letting the model explore multiple reasoning branches and backtrack. Self-consistency generates several CoT paths and takes a majority vote on the final answer, which is one of the most reliable accuracy boosters available. Each step up costs more tokens and latency, so the right choice depends on whether you are optimizing for cost, speed, or correctness — and how hard the problem actually is.