Zubnet AIApprendreWiki › Latency
Infrastructure

Latency

Aussi connu sous: Time to First Token (TTFT)
Le délai entre envoyer une requête et obtenir la première réponse. En IA, ça se mesure souvent comme Time to First Token (TTFT) — combien de temps avant que le modèle commence à streamer sa réponse. Affecté par la taille du modèle, la charge du serveur, la distance réseau et la longueur du prompt.

Pourquoi c'est important

Les utilisateurs perçoivent tout ce qui dépasse ~2 secondes comme lent. Une basse latence est pourquoi les petits modèles gagnent souvent pour les applications temps réel même quand les plus gros modèles sont « plus intelligents ». C'est un différenciateur clé entre fournisseurs.

Deep Dive

Latency in AI systems breaks down into several distinct components, and understanding each one helps you diagnose what's actually slow. First there's network latency — the round-trip time for your request to reach the provider's server and for the first bytes of the response to come back. This is typically 20-100ms depending on your geographic distance from the datacenter. Then there's queue time — how long your request waits before a GPU is available to process it. During peak hours or for popular models, this can range from zero to several seconds. Next comes prefill time — the model processing your entire input prompt. For a 1,000-token prompt on a large model, this might take 200-500ms. Finally, decode begins and you get your first token. The total of all these stages is your TTFT (Time to First Token).

Tokens Per Second

After the first token arrives, there's a second latency metric that matters just as much: inter-token latency, or how quickly subsequent tokens stream in. This is typically measured in tokens per second. GPT-4o might stream at 80-100 tokens/second, while Claude streams at similar speeds for most requests. For a chatbot, anything above about 30 tokens/second feels "instant" to a human reader — faster than you can read. Below 15 tokens/second, the streaming starts to feel choppy. This is why providers sometimes quote both TTFT and tokens/second — they're measuring different user experience bottlenecks. A response could start quickly but stream slowly, or take a moment to begin but then fly.

The Prompt Length Trap

Prompt length has a bigger impact on latency than most developers expect. The prefill phase scales roughly quadratically with input length for standard transformer models (thanks to self-attention), so a 10,000-token prompt doesn't just take 10x longer than a 1,000-token prompt — it can take significantly more. This is why providers like Anthropic charge differently for input vs. output tokens and why stuffing your entire codebase into a context window has real performance consequences. Techniques like prompt caching help enormously here: Anthropic's prompt caching feature lets you mark a portion of your prompt as cacheable, so if you're sending the same system prompt with every request (which most applications do), the prefill for that portion is essentially free after the first call.

What to Watch For

The most common mistake developers make with latency is testing with short prompts during development and then being surprised by production performance. A 50-token test prompt responds in 300ms; the real production prompt with a system message, few-shot examples, and conversation history totaling 4,000 tokens responds in 2 seconds. The other gotcha is geographic routing — if your server is in Europe but you're calling a US-based API endpoint, you're adding 100-150ms of network latency to every single request. Some providers offer regional endpoints, and the smarter inference proxy services will route your traffic to the nearest datacenter automatically. For real-time applications like voice assistants, where total end-to-end latency needs to stay under 500ms, every one of these components matters and you end up optimizing all of them simultaneously.

Concepts liés

← Tous les termes
← Large Language Model Layer →
ESC