Inference

The process of running a trained model to generate outputs. Training is learning; inference is using what was learned. Every time you send a prompt to Claude or generate an image with Stable Diffusion, that's inference. It's what costs providers GPU hours and what you pay for per token.

Why it matters

Inference cost and speed determine the economics of AI products. Faster inference = lower latency = better UX. Cheaper inference = lower prices = wider adoption. The entire quantization and optimization industry exists to make inference more efficient.

Deep Dive

For large language models, inference happens in two distinct phases, and understanding them explains most of the performance characteristics you'll observe. The first phase is called "prefill" or "prompt processing" — the model reads your entire input prompt and builds up its internal state (the KV cache). This phase is compute-bound and benefits from GPU parallelism because all input tokens can be processed simultaneously. The second phase is "decode" or "generation" — the model produces output tokens one at a time, each one depending on all previous tokens. This phase is memory-bandwidth-bound because the model needs to read its weights from VRAM for each token but does relatively little computation per read. This is why Time to First Token (TTFT) and tokens-per-second are measured separately: they reflect fundamentally different bottlenecks.

Throughput vs. Latency

The economics of inference are dominated by a concept called "throughput vs. latency." If you're serving a chatbot where one user is waiting for a response, you want low latency — get that first token out fast. But if you're running batch processing (summarizing 10,000 documents overnight), you want high throughput — process as many tokens per second as possible, even if each individual request is slower. Inference engines like vLLM and TensorRT-LLM use a technique called "continuous batching" to dynamically group multiple requests together, which dramatically improves throughput. A single H100 might generate 40 tokens/second for one request, but by batching cleverly, the same GPU can serve 20+ concurrent users at acceptable latency because the memory bandwidth is shared more efficiently.

The Serving Landscape

The inference serving landscape has splintered into distinct approaches. Cloud API providers (Anthropic, OpenAI, Google) run massive GPU clusters and sell inference as a service, priced per token. Inference-focused providers like Groq bet on custom hardware — Groq's LPU (Language Processing Unit) is specifically designed for the sequential decode phase and achieves remarkably fast token generation. On the open-source side, llama.cpp brought LLM inference to CPUs and consumer GPUs through aggressive quantization, and tools like Ollama wrapped it in a user-friendly package. For production self-hosting, vLLM with PagedAttention has become the default choice, offering throughput that rivals commercial offerings when tuned correctly.

The Cost Reality

A common misconception is that inference is "cheap" compared to training. For a single request, yes — generating a response costs a fraction of a cent. But inference is ongoing. A popular chatbot handles millions of requests per day, indefinitely. OpenAI reportedly spends more on inference than training at this point. This is why inference optimization is such a hot area: speculative decoding (using a small "draft" model to predict what the large model will say), KV cache compression, and prefix caching (reusing computation for shared system prompts) all aim to squeeze more responses out of the same hardware. Every percentage point of efficiency improvement translates directly into millions of dollars saved at scale.

Why it matters

Deep Dive

Throughput vs. Latency

The Serving Landscape

The Cost Reality

Related Concepts