Zubnet AI学习Wiki › Deepgram
公司

Deepgram

又名: Nova speech-to-text, Aura text-to-speech
Speech AI 公司,构建快速、精确的语音识别和 text-to-speech API。他们的 Nova 模型在精度上与 OpenAI 的 Whisper 竞争,常常击败它,同时在实时应用上跑得显著更快。

为什么重要

Deepgram 证明了一家初创可以用端到端深度学习从零构建语音识别,在精度上和 Google、Amazon、Microsoft 正面竞争,同时在速度上击败它们。他们 developer-first 的 API 方法把现代基础设施模式带入 Voice AI,让给 app 加转录像用 Stripe 加支付一样简单。当对话式 AI agent 变得主流,Deepgram 把自己定位为底下关键的语音基础设施层 — 让 voice-first AI 在生产中真正工作的管道。

Deep Dive

Deepgram was founded in 2015 by Scott Stephenson, Noah Shutty, and Adam Sypniewski, three physicists who had been working on dark matter detection at the University of Michigan. The connection between particle physics and speech recognition is less weird than it sounds — both involve extracting faint signals from enormous amounts of noisy data. Stephenson saw an opportunity to apply end-to-end deep learning to speech recognition at a time when most commercial systems still relied on older hybrid architectures with hand-tuned acoustic models and language models stitched together. The company went through Y Combinator in 2016, then spent years in relative obscurity, building their technology and landing enterprise contracts. By 2022, they had raised over $85 million, including a $72 million Series B led by Tiger Global, and were processing billions of minutes of audio annually.

The Technical Bet

Deepgram built their speech recognition from scratch using end-to-end deep learning, rather than building on top of existing open-source models. This gave them control over the entire pipeline and let them optimize for things enterprise customers actually care about: speed, accuracy on domain-specific vocabulary, speaker diarization, and the ability to fine-tune models on a customer's own data. Their Nova model family, which launched in 2023 and iterated through Nova-2 and Nova-3, consistently topped accuracy benchmarks while maintaining some of the lowest latency in the industry. Nova-3 in particular became known for its performance on real-world audio — phone calls, meetings, noisy environments — where academic benchmarks often fail to predict real performance. They also built Aura, a text-to-speech system, positioning themselves as a full-stack voice AI platform.

Developer-First Strategy

Where older speech companies like Nuance sold to enterprises through long sales cycles and custom integrations, Deepgram went after developers first. Their API is clean, their documentation is good, and pricing is transparent and usage-based — pay per audio minute, no minimums, no contracts required. This approach let them build a large community of developers who tried Deepgram for side projects and then brought it into their companies. The strategy mirrors what Twilio did for communications and what Stripe did for payments: make the developer experience so good that bottom-up adoption does your sales work for you. They also offer on-premises deployment for customers with strict data sovereignty requirements, which matters a lot in healthcare, finance, and government.

Competing with Giants and Open Source

Deepgram operates in one of the most competitive corners of AI. Google, Amazon, Microsoft, and IBM all offer speech-to-text APIs backed by massive R&D budgets. OpenAI's Whisper, released as open source in 2022, gave every developer free access to a good-enough transcription model. Against this, Deepgram competes on speed, accuracy, customization, and the overall developer experience. Their real-time streaming transcription is consistently faster than the big cloud providers, and their ability to train custom models on specific domains — medical terminology, legal jargon, brand names — gives them an edge for enterprise use cases where generic models struggle. The open-source threat is real but somewhat overstated: running Whisper at scale with low latency, high availability, and enterprise features is harder than it looks, and most companies would rather pay for a managed service.

The Voice AI Platform Play

Deepgram has been steadily expanding from pure transcription into a broader voice AI platform. With the addition of text-to-speech (Aura), voice agents, and audio intelligence features like sentiment analysis and topic detection, they are positioning themselves as the infrastructure layer for conversational AI. The timing is deliberate — as AI agents that can hold real phone conversations become viable, someone needs to provide the fast, accurate speech pipeline underneath, and Deepgram wants to be that provider. Their $47 million in additional funding raised in 2024 was partly aimed at this expansion, bringing total funding to over $130 million.

相关概念

← 所有术语
← Deepfakes DeepL →
ESC