Zubnet AI学习Wiki › Streaming
Using AI

Streaming

Server-Sent Events, Token Streaming
在模型输出生成时按 token 逐个发给用户,而不是等完整响应。流式用 HTTP 上的 Server-Sent Events(SSE) — 连接保持打开,服务器把每个新 token 作为一个小事件推送。这就是为什么你在聊天界面里看到文字一个词一个词出现。

为什么重要

流式改变用户体验。一个要 10 秒的响应,当你看到它一个词一个词构建时,感觉可接受。同样的响应在 10 秒空白屏之后一次性交付,感觉坏了。流式也让用户早点打断坏响应,省 token 和钱。

Deep Dive

Technically, streaming uses the stream: true parameter in API calls. The server responds with a stream of SSE events, each containing one or a few tokens plus metadata (like token counts, stop reason). The client reads these events incrementally and renders them. Most SDKs handle the SSE parsing for you, but understanding the underlying mechanism helps when debugging latency issues or building custom streaming UIs.

Streaming Affects Architecture

Streaming isn't just a UI feature — it affects how you build applications. With streaming, you can't post-process the complete response before showing it (since it's not complete yet). If you need to validate, filter, or transform the response, you either process it in chunks (harder) or buffer the full response and show it after (defeating the purpose). 工具 like function calling also interact with streaming: the model might stream a tool call, then pause while your code executes the tool, then resume streaming the final answer.

Time to First Token

In a streaming context, the key latency metric is TTFT (Time to First Token) — how long before the first token appears. This depends on prompt processing time (longer prompts take longer to process before generation starts) and server load. TTFT of under 500ms feels instant; over 2 seconds feels sluggish. After the first token, inter-token latency (the gap between successive tokens) determines how smooth the stream looks. Most providers achieve 20–50ms inter-token latency, which looks natural.

相关概念

← 所有术语
← Stochastic Parrot Structured Output →