Zubnet AI学习Wiki › Throughput
基础设施

Throughput

Tokens Per Second, TPS
一个系统跨所有并发请求每秒能生成的 token 总数。与延迟(一个单独请求服务多快)不同。高吞吐系统同时服务很多用户。低延迟系统快速服务每个单独用户。两者常常互相牺牲。

为什么重要

构建 AI 产品时,吞吐决定你的服务成本和容量。一个每用户每秒生成 100 token 但一次只能服务一个用户的系统,即便个人延迟很棒,吞吐也很低。为数千并发用户付 GPU 账单时,你优化的是吞吐。

Deep Dive

The distinction matters most in production. Latency (particularly TTFT — time to first token) determines user experience for a single request. Throughput determines how many users you can serve with a given number of GPUs. Techniques that improve one often hurt the other: batching many requests together improves throughput (the GPU stays busy) but increases latency (each request waits for the batch).

Continuous Batching

The breakthrough in LLM serving was continuous batching (also called in-flight batching). Instead of waiting for all requests in a batch to finish before starting new ones, continuous batching adds new requests to the batch as slots open up. This keeps GPU utilization high and prevents short requests from being held up by long ones. vLLM, TGI, and TensorRT-LLM all implement this.

The Economics

At scale, throughput directly determines cost per token. A server generating 10,000 tokens/second at $10/hour costs $0.001 per 1,000 tokens. The same server at 1,000 tokens/second costs $0.01. This 10x difference is why inference optimization (quantization, speculative decoding, better batching) matters so much — it's not just faster, it's cheaper. Providers who optimize throughput can offer lower prices or higher margins.

相关概念

← 所有术语
← Text-to-Speech Together AI →