Zubnet AI学习Wiki › Speech Recognition
Using AI

Speech Recognition

STT, Speech-to-Text, ASR
把口语音频转成文字。现代语音识别用深度学习模型(尤其是 OpenAI 的 Whisper),能以接近人类的精度在 100+ 种语言里转录音频。这项技术驱动语音助手、会议转录、字幕生成和无障碍工具。

为什么重要

语音识别把语音解锁为 AI 的输入模态。结合 LLM 和 text-to-speech,它实现完全语音驱动的 AI 交互。Whisper 的开源 release 让高质量转录民主化 — 你可以在本地免费运行。对无障碍来说它是变革性的:让音频内容可搜、可翻译、可用于聋人和听力困难的用户。

Deep Dive

Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.

The Accuracy Leap

Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.

Real-Time vs. Batch

Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.

相关概念

← 所有术语
← Speculative Decoding Stability AI →