Zubnet AIApprendreWiki › Inference
Infrastructure

Inference

Le processus qui consiste à faire tourner un modèle entraîné pour générer des sorties. L'entraînement, c'est apprendre ; l'inférence, c'est utiliser ce qui a été appris. Chaque fois que tu envoies un prompt à Claude ou que tu génères une image avec Stable Diffusion, c'est de l'inférence. C'est ce qui coûte des GPU-hours aux fournisseurs et ce que tu paies au token.

Pourquoi c'est important

Le coût et la vitesse de l'inférence déterminent l'économie des produits IA. Inférence plus rapide = latence plus basse = meilleure UX. Inférence moins chère = prix plus bas = adoption plus large. Toute l'industrie de la quantization et de l'optimisation existe pour rendre l'inférence plus efficace.

Deep Dive

For large language models, inference happens in two distinct phases, and understanding them explains most of the performance characteristics you'll observe. The first phase is called "prefill" or "prompt processing" — the model reads your entire input prompt and builds up its internal state (the KV cache). This phase is compute-bound and benefits from GPU parallelism because all input tokens can be processed simultaneously. The second phase is "decode" or "generation" — the model produces output tokens one at a time, each one depending on all previous tokens. This phase is memory-bandwidth-bound because the model needs to read its weights from VRAM for each token but does relatively little computation per read. This is why Time to First Token (TTFT) and tokens-per-second are measured separately: they reflect fundamentally different bottlenecks.

Throughput vs. Latency

The economics of inference are dominated by a concept called "throughput vs. latency." If you're serving a chatbot where one user is waiting for a response, you want low latency — get that first token out fast. But if you're running batch processing (summarizing 10,000 documents overnight), you want high throughput — process as many tokens per second as possible, even if each individual request is slower. Inference engines like vLLM and TensorRT-LLM use a technique called "continuous batching" to dynamically group multiple requests together, which dramatically improves throughput. A single H100 might generate 40 tokens/second for one request, but by batching cleverly, the same GPU can serve 20+ concurrent users at acceptable latency because the memory bandwidth is shared more efficiently.

The Serving Landscape

The inference serving landscape has splintered into distinct approaches. Cloud API providers (Anthropic, OpenAI, Google) run massive GPU clusters and sell inference as a service, priced per token. Inference-focused providers like Groq bet on custom hardware — Groq's LPU (Language Processing Unit) is specifically designed for the sequential decode phase and achieves remarkably fast token generation. On the open-source side, llama.cpp brought LLM inference to CPUs and consumer GPUs through aggressive quantization, and tools like Ollama wrapped it in a user-friendly package. For production self-hosting, vLLM with PagedAttention has become the default choice, offering throughput that rivals commercial offerings when tuned correctly.

The Cost Reality

A common misconception is that inference is "cheap" compared to training. For a single request, yes — generating a response costs a fraction of a cent. But inference is ongoing. A popular chatbot handles millions of requests per day, indefinitely. OpenAI reportedly spends more on inference than training at this point. This is why inference optimization is such a hot area: speculative decoding (using a small "draft" model to predict what the large model will say), KV cache compression, and prefix caching (reusing computation for shared system prompts) all aim to squeeze more responses out of the same hardware. Every percentage point of efficiency improvement translates directly into millions of dollars saved at scale.

Concepts liés

← Tous les termes
← Induction Head Instruction Following →
ESC