Zubnet AILearnWiki › Speech Recognition
Using AI

Speech Recognition

STT, Speech-to-Text, ASR
Converting spoken audio into text. Modern speech recognition uses deep learning models (most notably OpenAI's Whisper) that can transcribe audio in 100+ languages with near-human accuracy. The technology powers voice assistants, meeting transcription, subtitle generation, and accessibility tools.

Why it matters

Speech recognition unlocked voice as an input modality for AI. Combined with LLMs and text-to-speech, it enables fully voice-driven AI interactions. Whisper's open release democratized high-quality transcription — you can run it locally for free. For accessibility, it's transformative: making audio content searchable, translatable, and available to deaf and hard-of-hearing users.

Deep Dive

Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.

The Accuracy Leap

Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.

Real-Time vs. Batch

Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.

Related Concepts

← All Terms
← Speculative Decoding Stability AI →