Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.
Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.
Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.