Zubnet AI學習Wiki › Latency
基礎設施

Latency

又名: Time to First Token (TTFT)
發送請求和得到第一個回應之間的延遲。在 AI 裡,常常測量為 Time to First Token(TTFT) — 模型開始串流輸出答案前多久。受模型大小、伺服器負載、網路距離、prompt 長度影響。

為什麼重要

使用者把超過約 2 秒的任何東西都感知為慢。低延遲就是為什麼小模型對即時應用經常勝出,即便更大的模型「更聰明」。它是供應商之間的關鍵差異點。

Deep Dive

Latency in AI systems breaks down into several distinct components, and understanding each one helps you diagnose what's actually slow. First there's network latency — the round-trip time for your request to reach the provider's server and for the first bytes of the response to come back. This is typically 20-100ms depending on your geographic distance from the datacenter. Then there's queue time — how long your request waits before a GPU is available to process it. During peak hours or for popular models, this can range from zero to several seconds. Next comes prefill time — the model processing your entire input prompt. For a 1,000-token prompt on a large model, this might take 200-500ms. Finally, decode begins and you get your first token. The total of all these stages is your TTFT (Time to First Token).

Tokens Per Second

After the first token arrives, there's a second latency metric that matters just as much: inter-token latency, or how quickly subsequent tokens stream in. This is typically measured in tokens per second. GPT-4o might stream at 80-100 tokens/second, while Claude streams at similar speeds for most requests. For a chatbot, anything above about 30 tokens/second feels "instant" to a human reader — faster than you can read. Below 15 tokens/second, the streaming starts to feel choppy. This is why providers sometimes quote both TTFT and tokens/second — they're measuring different user experience bottlenecks. A response could start quickly but stream slowly, or take a moment to begin but then fly.

The Prompt Length Trap

Prompt length has a bigger impact on latency than most developers expect. The prefill phase scales roughly quadratically with input length for standard transformer models (thanks to self-attention), so a 10,000-token prompt doesn't just take 10x longer than a 1,000-token prompt — it can take significantly more. This is why providers like Anthropic charge differently for input vs. output tokens and why stuffing your entire codebase into a context window has real performance consequences. Techniques like prompt caching help enormously here: Anthropic's prompt caching feature lets you mark a portion of your prompt as cacheable, so if you're sending the same system prompt with every request (which most applications do), the prefill for that portion is essentially free after the first call.

What to Watch For

The most common mistake developers make with latency is testing with short prompts during development and then being surprised by production performance. A 50-token test prompt responds in 300ms; the real production prompt with a system message, few-shot examples, and conversation history totaling 4,000 tokens responds in 2 seconds. The other gotcha is geographic routing — if your server is in Europe but you're calling a US-based API endpoint, you're adding 100-150ms of network latency to every single request. Some providers offer regional endpoints, and the smarter inference proxy services will route your traffic to the nearest datacenter automatically. For real-time applications like voice assistants, where total end-to-end latency needs to stay under 500ms, every one of these components matters and you end up optimizing all of them simultaneously.

相關概念

← 所有術語
← Large Language Model Layer →
ESC