Zubnet AI学习Wiki › Inference
基础设施

Inference

让训练好的模型跑起来、产出输出的过程。训练是学习,推理是使用学到的东西。每次你把 prompt 发给 Claude、或用 Stable Diffusion 生成一张图,那就是推理。推理让提供商付出 GPU-小时,也是你按 token 付费的那一部分。

为什么重要

推理的成本与速度决定了 AI 产品的经济学。推理更快 = 延迟更低 = 体验更好。推理更便宜 = 价格更低 = 采用更广。整个量化与优化行业存在的目的,就是让推理更高效。

Deep Dive

For large language models, inference happens in two distinct phases, and understanding them explains most of the performance characteristics you'll observe. The first phase is called "prefill" or "prompt processing" — the model reads your entire input prompt and builds up its internal state (the KV cache). This phase is compute-bound and benefits from GPU parallelism because all input tokens can be processed simultaneously. The second phase is "decode" or "generation" — the model produces output tokens one at a time, each one depending on all previous tokens. This phase is memory-bandwidth-bound because the model needs to read its weights from VRAM for each token but does relatively little computation per read. This is why Time to First Token (TTFT) and tokens-per-second are measured separately: they reflect fundamentally different bottlenecks.

Throughput vs. Latency

The economics of inference are dominated by a concept called "throughput vs. latency." If you're serving a chatbot where one user is waiting for a response, you want low latency — get that first token out fast. But if you're running batch processing (summarizing 10,000 documents overnight), you want high throughput — process as many tokens per second as possible, even if each individual request is slower. Inference engines like vLLM and TensorRT-LLM use a technique called "continuous batching" to dynamically group multiple requests together, which dramatically improves throughput. A single H100 might generate 40 tokens/second for one request, but by batching cleverly, the same GPU can serve 20+ concurrent users at acceptable latency because the memory bandwidth is shared more efficiently.

The Serving Landscape

The inference serving landscape has splintered into distinct approaches. Cloud API providers (Anthropic, OpenAI, Google) run massive GPU clusters and sell inference as a service, priced per token. Inference-focused providers like Groq bet on custom hardware — Groq's LPU (Language Processing Unit) is specifically designed for the sequential decode phase and achieves remarkably fast token generation. On the open-source side, llama.cpp brought LLM inference to CPUs and consumer GPUs through aggressive quantization, and tools like Ollama wrapped it in a user-friendly package. For production self-hosting, vLLM with PagedAttention has become the default choice, offering throughput that rivals commercial offerings when tuned correctly.

The Cost Reality

A common misconception is that inference is "cheap" compared to training. For a single request, yes — generating a response costs a fraction of a cent. But inference is ongoing. A popular chatbot handles millions of requests per day, indefinitely. OpenAI reportedly spends more on inference than training at this point. This is why inference optimization is such a hot area: speculative decoding (using a small "draft" model to predict what the large model will say), KV cache compression, and prefix caching (reusing computation for shared system prompts) all aim to squeeze more responses out of the same hardware. Every percentage point of efficiency improvement translates directly into millions of dollars saved at scale.

相关概念

← 所有术语
← Induction Head Instruction Following →
ESC