Zubnet AI學習Wiki › Speech Recognition
Using AI

Speech Recognition

STT, Speech-to-Text, ASR
把口語音訊轉成文字。現代語音辨識用深度學習模型(尤其是 OpenAI 的 Whisper),能以接近人類的精度在 100+ 種語言裡轉錄音訊。這項技術驅動語音助手、會議轉錄、字幕生成和無障礙工具。

為什麼重要

語音辨識把語音解鎖為 AI 的輸入模態。結合 LLM 和 text-to-speech,它實現完全語音驅動的 AI 互動。Whisper 的開源釋出讓高品質轉錄民主化 — 你可以在本地免費運行。對無障礙來說它是變革性的:讓音訊內容可搜、可翻譯、可用於聾人和聽力困難的使用者。

Deep Dive

Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.

The Accuracy Leap

Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.

Real-Time vs. Batch

Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.

相關概念

← 所有術語
← Speculative Decoding Stability AI →