Zubnet AI學習Wiki › Test-Time Compute
基礎

Test-Time Compute

Inference-Time Compute, Chain of Thought, Thinking Tokens
在推理時(模型生成回應時)使用額外運算來改善答案品質。不是立即生成答案,模型「想」得更久 — 生成推理 token、探索多種方法、或驗證自己的輸出。測試時更多運算產出更好的答案,尤其對複雜推理任務。

為什麼重要

測試時運算是最新的 scaling 範式。第一個時代 scale 訓練算力(更大模型、更多資料)。當前時代也 scale 推理算力(每問題更多思考)。o1 和帶 extended thinking 的 Claude 這樣的模型顯示,讓模型推理 30 秒常常超過 2 秒內回答的模型,即便快模型技術上更大。這改變經濟學:品質成為你願意為每個查詢花多少的函數。

Deep Dive

The simplest form of test-time compute is chain-of-thought: the model generates reasoning steps before the final answer. More sophisticated approaches include: tree-of-thought (exploring multiple reasoning paths and selecting the best), self-consistency (generating multiple answers and voting), and iterative refinement (the model critiques and revises its own output). Each approach uses more tokens (= more compute = more cost) but produces better results.

Extended Thinking

Models like o1 (OpenAI) and Claude with extended thinking generate internal reasoning tokens that the user doesn't see. These "thinking tokens" let the model decompose complex problems, check its work, consider edge cases, and revise its approach — all before producing the visible response. The cost is higher (you pay for thinking tokens) and latency is longer, but accuracy on math, coding, and reasoning tasks improves dramatically.

Scaling Laws for Inference

Research suggests that test-time compute follows its own scaling laws: doubling inference compute (thinking time) produces predictable improvements in accuracy, analogous to how doubling training compute improves pre-training loss. This means you can choose your quality-cost trade-off per query: simple questions get fast, cheap answers; complex questions get longer, more expensive reasoning. This dynamic allocation is more efficient than using the same compute for every query.

相關概念

← 所有術語
← Tencent Text-to-Speech →