Zubnet AIAprenderWiki › Latency
Infraestrutura

Latency

Também conhecido como: Time to First Token (TTFT)
O atraso entre enviar uma request e obter a primeira resposta. Em IA, isso é frequentemente medido como Time to First Token (TTFT) — quanto tempo antes do modelo começar a transmitir sua resposta. Afetado por tamanho do modelo, carga do servidor, distância de rede e comprimento do prompt.

Por que importa

Usuários percebem qualquer coisa acima de ~2 segundos como lento. Baixa latência é por que modelos menores muitas vezes vencem para aplicações em tempo real mesmo quando modelos maiores são “mais inteligentes”. É um diferenciador-chave entre provedores.

Deep Dive

Latency in AI systems breaks down into several distinct components, and understanding each one helps you diagnose what's actually slow. First there's network latency — the round-trip time for your request to reach the provider's server and for the first bytes of the response to come back. This is typically 20-100ms depending on your geographic distance from the datacenter. Then there's queue time — how long your request waits before a GPU is available to process it. During peak hours or for popular models, this can range from zero to several seconds. Next comes prefill time — the model processing your entire input prompt. For a 1,000-token prompt on a large model, this might take 200-500ms. Finally, decode begins and you get your first token. The total of all these stages is your TTFT (Time to First Token).

Tokens Per Second

After the first token arrives, there's a second latency metric that matters just as much: inter-token latency, or how quickly subsequent tokens stream in. This is typically measured in tokens per second. GPT-4o might stream at 80-100 tokens/second, while Claude streams at similar speeds for most requests. For a chatbot, anything above about 30 tokens/second feels "instant" to a human reader — faster than you can read. Below 15 tokens/second, the streaming starts to feel choppy. This is why providers sometimes quote both TTFT and tokens/second — they're measuring different user experience bottlenecks. A response could start quickly but stream slowly, or take a moment to begin but then fly.

The Prompt Length Trap

Prompt length has a bigger impact on latency than most developers expect. The prefill phase scales roughly quadratically with input length for standard transformer models (thanks to self-attention), so a 10,000-token prompt doesn't just take 10x longer than a 1,000-token prompt — it can take significantly more. This is why providers like Anthropic charge differently for input vs. output tokens and why stuffing your entire codebase into a context window has real performance consequences. Techniques like prompt caching help enormously here: Anthropic's prompt caching feature lets you mark a portion of your prompt as cacheable, so if you're sending the same system prompt with every request (which most applications do), the prefill for that portion is essentially free after the first call.

What to Watch For

The most common mistake developers make with latency is testing with short prompts during development and then being surprised by production performance. A 50-token test prompt responds in 300ms; the real production prompt with a system message, few-shot examples, and conversation history totaling 4,000 tokens responds in 2 seconds. The other gotcha is geographic routing — if your server is in Europe but you're calling a US-based API endpoint, you're adding 100-150ms of network latency to every single request. Some providers offer regional endpoints, and the smarter inference proxy services will route your traffic to the nearest datacenter automatically. For real-time applications like voice assistants, where total end-to-end latency needs to stay under 500ms, every one of these components matters and you end up optimizing all of them simultaneously.

Conceitos relacionados

← Todos os termos
← Large Language Model Layer →
ESC