Cohere released Transcribe, an automatic speech recognition model that achieves a 5.42% average word error rate across standard benchmarks, claiming the top spot on Hugging Face's Open ASR Leaderboard. The model uses a Conformer encoder paired with a lightweight Transformer decoder, supporting 14 languages including English, Chinese, Japanese, and Arabic. In head-to-head human evaluations, annotators preferred Cohere's transcripts 78% of the time against IBM Granite and 64% against OpenAI's Whisper Large v3.

This represents Cohere's first major push beyond text generation into speech processing, a strategic move as enterprises increasingly need to process audio data at scale. The Conformer architecture makes sense here—combining CNNs for local acoustic features with Transformers for global context addresses real ASR challenges better than pure attention mechanisms. However, the model's constraint to 35-second audio chunks for long-form content exposes the memory limitations that still plague production speech systems.

What's notable is Cohere's "quality over quantity" approach with just 14 languages, directly competing against Whisper's 100+ language support. The benchmarks look impressive, but enterprise ASR lives in the messy reality of accented speech, background noise, and domain-specific jargon that standard test sets don't capture. The human preference metrics are more telling—real users can distinguish quality differences that WER scores miss.

For developers building speech applications, this gives you another strong option beyond OpenAI and ElevenLabs, especially if you need self-hosted deployment. The 35-second chunking limitation means you'll still need preprocessing pipelines for long audio, but the accuracy gains might justify the engineering overhead. Worth testing on your actual data—benchmarks rarely survive contact with production audio." "tags": ["speech-recognition", "cohere", "enterprise-ai", "transcription