Zubnet AI學習Wiki › Deepgram
公司

Deepgram

又名: Nova speech-to-text, Aura text-to-speech
Speech AI 公司,建構快速、精確的語音辨識和 text-to-speech API。他們的 Nova 模型在精度上與 OpenAI 的 Whisper 競爭,常常擊敗它,同時在即時應用上跑得顯著更快。

為什麼重要

Deepgram 證明了一家新創可以用端到端深度學習從零建構語音辨識,在精度上和 Google、Amazon、Microsoft 正面競爭,同時在速度上擊敗它們。他們 developer-first 的 API 方法把現代基礎設施模式帶入 Voice AI,讓給 app 加轉錄像用 Stripe 加支付一樣簡單。當對話式 AI agent 變得主流,Deepgram 把自己定位為底下關鍵的語音基礎設施層 — 讓 voice-first AI 在生產中真正運作的管道。

Deep Dive

Deepgram was founded in 2015 by Scott Stephenson, Noah Shutty, and Adam Sypniewski, three physicists who had been working on dark matter detection at the University of Michigan. The connection between particle physics and speech recognition is less weird than it sounds — both involve extracting faint signals from enormous amounts of noisy data. Stephenson saw an opportunity to apply end-to-end deep learning to speech recognition at a time when most commercial systems still relied on older hybrid architectures with hand-tuned acoustic models and language models stitched together. The company went through Y Combinator in 2016, then spent years in relative obscurity, building their technology and landing enterprise contracts. By 2022, they had raised over $85 million, including a $72 million Series B led by Tiger Global, and were processing billions of minutes of audio annually.

The Technical Bet

Deepgram built their speech recognition from scratch using end-to-end deep learning, rather than building on top of existing open-source models. This gave them control over the entire pipeline and let them optimize for things enterprise customers actually care about: speed, accuracy on domain-specific vocabulary, speaker diarization, and the ability to fine-tune models on a customer's own data. Their Nova model family, which launched in 2023 and iterated through Nova-2 and Nova-3, consistently topped accuracy benchmarks while maintaining some of the lowest latency in the industry. Nova-3 in particular became known for its performance on real-world audio — phone calls, meetings, noisy environments — where academic benchmarks often fail to predict real performance. They also built Aura, a text-to-speech system, positioning themselves as a full-stack voice AI platform.

Developer-First Strategy

Where older speech companies like Nuance sold to enterprises through long sales cycles and custom integrations, Deepgram went after developers first. Their API is clean, their documentation is good, and pricing is transparent and usage-based — pay per audio minute, no minimums, no contracts required. This approach let them build a large community of developers who tried Deepgram for side projects and then brought it into their companies. The strategy mirrors what Twilio did for communications and what Stripe did for payments: make the developer experience so good that bottom-up adoption does your sales work for you. They also offer on-premises deployment for customers with strict data sovereignty requirements, which matters a lot in healthcare, finance, and government.

Competing with Giants and Open Source

Deepgram operates in one of the most competitive corners of AI. Google, Amazon, Microsoft, and IBM all offer speech-to-text APIs backed by massive R&D budgets. OpenAI's Whisper, released as open source in 2022, gave every developer free access to a good-enough transcription model. Against this, Deepgram competes on speed, accuracy, customization, and the overall developer experience. Their real-time streaming transcription is consistently faster than the big cloud providers, and their ability to train custom models on specific domains — medical terminology, legal jargon, brand names — gives them an edge for enterprise use cases where generic models struggle. The open-source threat is real but somewhat overstated: running Whisper at scale with low latency, high availability, and enterprise features is harder than it looks, and most companies would rather pay for a managed service.

The Voice AI Platform Play

Deepgram has been steadily expanding from pure transcription into a broader voice AI platform. With the addition of text-to-speech (Aura), voice agents, and audio intelligence features like sentiment analysis and topic detection, they are positioning themselves as the infrastructure layer for conversational AI. The timing is deliberate — as AI agents that can hold real phone conversations become viable, someone needs to provide the fast, accurate speech pipeline underneath, and Deepgram wants to be that provider. Their $47 million in additional funding raised in 2024 was partly aimed at this expansion, bringing total funding to over $130 million.

相關概念

← 所有術語
← Deepfakes DeepL →
ESC