Zubnet AI學習Wiki › Throughput
基礎設施

Throughput

Tokens Per Second, TPS
一個系統跨所有並發請求每秒能生成的 token 總數。與延遲(一個單獨請求服務多快)不同。高吞吐系統同時服務很多使用者。低延遲系統快速服務每個單獨使用者。兩者常常互相犧牲。

為什麼重要

建構 AI 產品時,吞吐決定你的服務成本和容量。一個每使用者每秒生成 100 token 但一次只能服務一個使用者的系統,即便個人延遲很棒,吞吐也很低。為數千並發使用者付 GPU 帳單時,你優化的是吞吐。

Deep Dive

The distinction matters most in production. Latency (particularly TTFT — time to first token) determines user experience for a single request. Throughput determines how many users you can serve with a given number of GPUs. Techniques that improve one often hurt the other: batching many requests together improves throughput (the GPU stays busy) but increases latency (each request waits for the batch).

Continuous Batching

The breakthrough in LLM serving was continuous batching (also called in-flight batching). Instead of waiting for all requests in a batch to finish before starting new ones, continuous batching adds new requests to the batch as slots open up. This keeps GPU utilization high and prevents short requests from being held up by long ones. vLLM, TGI, and TensorRT-LLM all implement this.

The Economics

At scale, throughput directly determines cost per token. A server generating 10,000 tokens/second at $10/hour costs $0.001 per 1,000 tokens. The same server at 1,000 tokens/second costs $0.01. This 10x difference is why inference optimization (quantization, speculative decoding, better batching) matters so much — it's not just faster, it's cheaper. Providers who optimize throughput can offer lower prices or higher margins.

相關概念

← 所有術語
← Text-to-Speech Together AI →