A/B Testing for AI: Definition & Meaning — AI Wiki

透過隨機分配真實使用者到每個變體並測量哪個在重要指標上表現更好,來比較兩個 AI 系統變體(不同模型、prompt、或配置)。不像離線評估(benchmark、測試集),A/B 測試揭示變化如何影響實際使用者行為 — 參與度、滿意度、任務完成、營收。

為什麼重要

離線指標不總是預測真實世界表現。在 benchmark 上得分更高的模型可能產出使用者更不喜歡的回覆。一個改善品質的 prompt 變化可能增加延遲到使用者放棄的地步。A/B 測試是知道一個變化是否真的改善使用者體驗的唯一方式。這是每個主要 AI 產品做部署決定的方法。

Deep Dive

The setup: route 50% of users to variant A (current system) and 50% to variant B (proposed change). Collect metrics for both: response quality ratings, task completion rates, user retention, time-on-task, and business metrics (conversion, revenue). Run until you have statistical significance (typically 95% confidence). If B wins, roll it out to 100%. If A wins, discard B.

AI-Specific Challenges

A/B testing AI systems has unique challenges. Response quality is subjective and hard to measure automatically. Users might rate responses differently based on mood, not quality. The same prompt can produce different responses (non-deterministic), adding noise. Carry-over effects: users who had a bad experience with variant A might rate everything lower afterwards. Careful experiment design and sufficient sample sizes are essential.

Shadow Mode

Before A/B testing with real users, many teams use shadow mode: run the new model alongside the current one, but only show users the current model's responses. Log both responses and compare quality offline (via LLM-as-judge or human review). This catches obvious regressions before any user is affected. Only after shadow mode validation does the new model graduate to a real A/B test.

A/B Testing for AI

為什麼重要

Deep Dive

AI-Specific Challenges

Shadow Mode

相關概念