Twelve Labs: Definition & Meaning — AI Wiki

影片理解公司,讓你用自然語言搜尋、分析、從影片生成內容。可以把它想成「影片的 RAG」 — 他們的模型理解影片裡發生了什麼,就像 LLM 理解文字一樣。

為什麼重要

Twelve Labs 在建構讓全世界影片內容機器可讀的基礎設施。在一個影片主導數位通訊但大部分仍然無法被 AI 搜尋的時代,他們專門建構的 embedding 和生成模型解決了即便最大前沿實驗室也只膚淺觸及的問題。如果影片是網路的主導媒介,誰在生產規模解決影片理解,誰就佔據了和 Google 搜尋對於文字相當的戰略位置。

Deep Dive

Twelve Labs was founded in 2021 by Jae Lee and Aiden Lee, who saw a massive gap in the AI landscape: while text-based models were advancing at breakneck speed, video remained stubbornly opaque to machines. You could ask an LLM to summarize a document in seconds, but asking it what happened at minute 14:32 of a two-hour video? Impossible. The founding team, with roots in computer vision research and experience at companies like Google and Samsung, recognized that video understanding required a fundamentally different approach from bolting image recognition onto a timeline. They set out to build multimodal foundation models that understand video natively — treating visual scenes, audio, speech, and on-screen text as a unified stream rather than separate channels stitched together after the fact.

Pegasus and Marengo: The Product Stack

Twelve Labs' core products are Pegasus and Marengo, each tackling a different piece of the video intelligence problem. Marengo is their video embedding model — it converts video content into rich vector representations that enable semantic search across massive video libraries. You can query "person in a red jacket opening a door" across thousands of hours of footage and get precise timestamp-level results, even if no one ever tagged or captioned that moment. Pegasus is their video-to-text generation model, capable of summarizing, describing, and answering questions about video content with a specificity that generic vision-language models struggle to match. Together, these models power an API that lets developers build applications like media asset management, compliance monitoring, content moderation, and educational video search without needing to build their own video ML pipeline from scratch.

Funding and Market Position

The company raised a $50 million Series A in 2024 led by NEA and NVentures (NVIDIA's venture arm), with participation from Index Ventures and existing investors. This brought their total funding past $70 million. The NVIDIA investment was particularly significant — it signaled that the GPU maker saw video understanding as a distinct, high-value market segment worth betting on, not just a feature that would eventually get absorbed into general-purpose multimodal models from OpenAI or Google. Twelve Labs has been deliberate about positioning themselves as infrastructure, not an end-user application. Their API-first approach means they don't compete with their customers; they're the plumbing that makes video-native AI applications possible across industries from media and entertainment to security and healthcare.

The Video Understanding Gap

The reason Twelve Labs has space to exist in a market dominated by well-funded generalist labs is that video is genuinely hard. A single hour of video at 30 frames per second contains 108,000 images, plus audio, speech, text overlays, and temporal relationships between all of them. General-purpose multimodal models like GPT-4o and Gemini can process short video clips, but they struggle with the scale, precision, and speed that production video applications demand. Twelve Labs' purpose-built architecture is designed for exactly this problem: fast indexing of massive video libraries, sub-second search across hundreds of thousands of hours, and generation tasks that require understanding what happened over time, not just in a single frame. As video continues to dominate internet traffic and enterprise data — Cisco estimates video will represent 82% of all IP traffic — the companies that can make that content searchable and actionable will own a uniquely valuable piece of the AI stack.

Twelve Labs

為什麼重要

Deep Dive

Pegasus and Marengo: The Product Stack

Funding and Market Position

The Video Understanding Gap

相關概念