Zubnet AI学习Wiki › Latency
基础设施

Latency

又名: Time to First Token (TTFT)
发送请求和得到第一个响应之间的延迟。在 AI 里,常常测量为 Time to First Token(TTFT) — 模型开始流式输出答案前多久。受模型大小、服务器负载、网络距离、prompt 长度影响。

为什么重要

用户把超过约 2 秒的任何东西都感知为慢。低延迟就是为什么小模型对实时应用经常胜出,即便更大的模型“更聪明”。它是供应商之间的关键差异点。

Deep Dive

Latency in AI systems breaks down into several distinct components, and understanding each one helps you diagnose what's actually slow. First there's network latency — the round-trip time for your request to reach the provider's server and for the first bytes of the response to come back. This is typically 20-100ms depending on your geographic distance from the datacenter. Then there's queue time — how long your request waits before a GPU is available to process it. During peak hours or for popular models, this can range from zero to several seconds. Next comes prefill time — the model processing your entire input prompt. For a 1,000-token prompt on a large model, this might take 200-500ms. Finally, decode begins and you get your first token. The total of all these stages is your TTFT (Time to First Token).

Tokens Per Second

After the first token arrives, there's a second latency metric that matters just as much: inter-token latency, or how quickly subsequent tokens stream in. This is typically measured in tokens per second. GPT-4o might stream at 80-100 tokens/second, while Claude streams at similar speeds for most requests. For a chatbot, anything above about 30 tokens/second feels "instant" to a human reader — faster than you can read. Below 15 tokens/second, the streaming starts to feel choppy. This is why providers sometimes quote both TTFT and tokens/second — they're measuring different user experience bottlenecks. A response could start quickly but stream slowly, or take a moment to begin but then fly.

The Prompt Length Trap

Prompt length has a bigger impact on latency than most developers expect. The prefill phase scales roughly quadratically with input length for standard transformer models (thanks to self-attention), so a 10,000-token prompt doesn't just take 10x longer than a 1,000-token prompt — it can take significantly more. This is why providers like Anthropic charge differently for input vs. output tokens and why stuffing your entire codebase into a context window has real performance consequences. Techniques like prompt caching help enormously here: Anthropic's prompt caching feature lets you mark a portion of your prompt as cacheable, so if you're sending the same system prompt with every request (which most applications do), the prefill for that portion is essentially free after the first call.

What to Watch For

The most common mistake developers make with latency is testing with short prompts during development and then being surprised by production performance. A 50-token test prompt responds in 300ms; the real production prompt with a system message, few-shot examples, and conversation history totaling 4,000 tokens responds in 2 seconds. The other gotcha is geographic routing — if your server is in Europe but you're calling a US-based API endpoint, you're adding 100-150ms of network latency to every single request. Some providers offer regional endpoints, and the smarter inference proxy services will route your traffic to the nearest datacenter automatically. For real-time applications like voice assistants, where total end-to-end latency needs to stay under 500ms, every one of these components matters and you end up optimizing all of them simultaneously.

相关概念

← 所有术语
← Large Language Model Layer →
ESC