Twelve Labs: Definition & Meaning — AI Wiki

视频理解公司,让你用自然语言搜索、分析、从视频生成内容。可以把它想成“视频的 RAG” — 他们的模型理解视频里发生了什么,就像 LLM 理解文本一样。

为什么重要

Twelve Labs 在构建让全世界视频内容机器可读的基础设施。在一个视频主导数字通讯但大部分仍然无法被 AI 搜索的时代,他们专门构建的 embedding 和生成模型解决了即便最大前沿实验室也只肤浅触及的问题。如果视频是互联网的主导媒介,谁在生产规模解决视频理解,谁就占据了和 Google 搜索对于文本相当的战略位置。

Deep Dive

Twelve Labs was founded in 2021 by Jae Lee and Aiden Lee, who saw a massive gap in the AI landscape: while text-based models were advancing at breakneck speed, video remained stubbornly opaque to machines. You could ask an LLM to summarize a document in seconds, but asking it what happened at minute 14:32 of a two-hour video? Impossible. The founding team, with roots in computer vision research and experience at companies like Google and Samsung, recognized that video understanding required a fundamentally different approach from bolting image recognition onto a timeline. They set out to build multimodal foundation models that understand video natively — treating visual scenes, audio, speech, and on-screen text as a unified stream rather than separate channels stitched together after the fact.

Pegasus and Marengo: The Product Stack

Twelve Labs' core products are Pegasus and Marengo, each tackling a different piece of the video intelligence problem. Marengo is their video embedding model — it converts video content into rich vector representations that enable semantic search across massive video libraries. You can query "person in a red jacket opening a door" across thousands of hours of footage and get precise timestamp-level results, even if no one ever tagged or captioned that moment. Pegasus is their video-to-text generation model, capable of summarizing, describing, and answering questions about video content with a specificity that generic vision-language models struggle to match. Together, these models power an API that lets developers build applications like media asset management, compliance monitoring, content moderation, and educational video search without needing to build their own video ML pipeline from scratch.

Funding and Market Position

The company raised a $50 million Series A in 2024 led by NEA and NVentures (NVIDIA's venture arm), with participation from Index Ventures and existing investors. This brought their total funding past $70 million. The NVIDIA investment was particularly significant — it signaled that the GPU maker saw video understanding as a distinct, high-value market segment worth betting on, not just a feature that would eventually get absorbed into general-purpose multimodal models from OpenAI or Google. Twelve Labs has been deliberate about positioning themselves as infrastructure, not an end-user application. Their API-first approach means they don't compete with their customers; they're the plumbing that makes video-native AI applications possible across industries from media and entertainment to security and healthcare.

The Video Understanding Gap

The reason Twelve Labs has space to exist in a market dominated by well-funded generalist labs is that video is genuinely hard. A single hour of video at 30 frames per second contains 108,000 images, plus audio, speech, text overlays, and temporal relationships between all of them. General-purpose multimodal models like GPT-4o and Gemini can process short video clips, but they struggle with the scale, precision, and speed that production video applications demand. Twelve Labs' purpose-built architecture is designed for exactly this problem: fast indexing of massive video libraries, sub-second search across hundreds of thousands of hours, and generation tasks that require understanding what happened over time, not just in a single frame. As video continues to dominate internet traffic and enterprise data — Cisco estimates video will represent 82% of all IP traffic — the companies that can make that content searchable and actionable will own a uniquely valuable piece of the AI stack.

Twelve Labs

为什么重要

Deep Dive

Pegasus and Marengo: The Product Stack

Funding and Market Position

The Video Understanding Gap

相关概念