Streaming: Definition & Meaning — AI Wiki

在模型輸出生成時按 token 逐個發給使用者,而不是等完整回應。串流用 HTTP 上的 Server-Sent Events(SSE) — 連線保持開啟,伺服器把每個新 token 作為一個小事件推送。這就是為什麼你在聊天介面裡看到文字一個詞一個詞出現。

為什麼重要

串流改變使用者體驗。一個要 10 秒的回應,當你看到它一個詞一個詞建構時,感覺可接受。同樣的回應在 10 秒空白螢幕之後一次性交付,感覺壞了。串流也讓使用者早點打斷壞回應,省 token 和錢。

Deep Dive

Technically, streaming uses the stream: true parameter in API calls. The server responds with a stream of SSE events, each containing one or a few tokens plus metadata (like token counts, stop reason). The client reads these events incrementally and renders them. Most SDKs handle the SSE parsing for you, but understanding the underlying mechanism helps when debugging latency issues or building custom streaming UIs.

Streaming Affects Architecture

Streaming isn't just a UI feature — it affects how you build applications. With streaming, you can't post-process the complete response before showing it (since it's not complete yet). If you need to validate, filter, or transform the response, you either process it in chunks (harder) or buffer the full response and show it after (defeating the purpose). 工具 like function calling also interact with streaming: the model might stream a tool call, then pause while your code executes the tool, then resume streaming the final answer.

Time to First Token

In a streaming context, the key latency metric is TTFT (Time to First Token) — how long before the first token appears. This depends on prompt processing time (longer prompts take longer to process before generation starts) and server load. TTFT of under 500ms feels instant; over 2 seconds feels sluggish. After the first token, inter-token latency (the gap between successive tokens) determines how smooth the stream looks. Most providers achieve 20–50ms inter-token latency, which looks natural.

Streaming

為什麼重要

Deep Dive

Streaming Affects Architecture

Time to First Token

相關概念