Zubnet AIसीखेंWiki › Speech Recognition
Using AI

Speech Recognition

STT, Speech-to-Text, ASR
Spoken audio को text में convert करना। Modern speech recognition deep learning models use करती है (विशेषकर OpenAI का Whisper) जो 100+ languages में audio को near-human accuracy के साथ transcribe कर सकते हैं। ये technology voice assistants, meeting transcription, subtitle generation, और accessibility tools को power देती है।

यह क्यों matter करता है

Speech recognition ने voice को AI के लिए एक input modality के रूप में unlock किया। LLMs और text-to-speech के साथ combine करने पर, ये fully voice-driven AI interactions enable करती है। Whisper की open release ने high-quality transcription को democratize किया — आप इसे locally free में run कर सकते हैं। Accessibility के लिए ये transformative है: audio content को searchable, translatable, और deaf और hard-of-hearing users के लिए available बनाना।

Deep Dive

Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.

The Accuracy Leap

Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.

Real-Time vs. Batch

Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.

संबंधित अवधारणाएँ

← सभी Terms
← Speculative Decoding Stability AI →