Zubnet AIAprenderWiki › Speech Recognition
Using AI

Speech Recognition

STT, Speech-to-Text, ASR
Convertir audio hablado a texto. El reconocimiento de voz moderno usa modelos de deep learning (notablemente Whisper de OpenAI) que pueden transcribir audio en 100+ idiomas con precisión cercana a la humana. La tecnología impulsa asistentes de voz, transcripción de reuniones, generación de subtítulos y herramientas de accesibilidad.

Por qué importa

El reconocimiento de voz desbloqueó la voz como modalidad de entrada para la IA. Combinado con LLMs y text-to-speech, habilita interacciones IA completamente dirigidas por voz. La liberación open de Whisper democratizó la transcripción de alta calidad — puedes correrla localmente gratis. Para accesibilidad, es transformador: hacer el contenido de audio buscable, traducible y disponible para usuarios sordos o con dificultad auditiva.

Deep Dive

Whisper (OpenAI, 2022) is the dominant open speech recognition model. It's an encoder-decoder Transformer trained on 680,000 hours of multilingual audio-text pairs scraped from the web. The encoder processes audio spectrograms (visual representations of sound frequencies), and the decoder generates text tokens. Whisper handles multiple tasks: transcription, translation (audio in French → text in English), and language identification.

The Accuracy Leap

Pre-Whisper, high-quality transcription required expensive commercial APIs or domain-specific models. Whisper matched commercial services at zero cost (the model is open-source). Its multilingual capability is particularly strong — it handles code-switching (mixing languages mid-sentence), accents, and background noise far better than previous open models. The larger Whisper variants (large-v3) approach human-level accuracy for clean audio.

Real-Time vs. Batch

Whisper was designed for batch processing (transcribe a complete audio file), not real-time streaming. Real-time applications require chunking audio into segments and transcribing them incrementally, which adds complexity around word boundaries and context. Specialized models and services (Deepgram, AssemblyAI) offer real-time streaming APIs. The choice depends on your latency requirements: batch for podcast transcription, streaming for live captioning.

Conceptos relacionados

← Todos los términos
← Speculative Decoding Stability AI →