Zubnet AILearnWiki › Twelve Labs
Companies

Twelve Labs

Also known as: Video search, Pegasus, Marengo
Video understanding company that lets you search, analyze, and generate content from video using natural language. Think of it as "RAG for video" — their models understand what happens in a video the way LLMs understand text.

Why it matters

Twelve Labs is building the foundational infrastructure for making the world's video content machine-readable. In an era where video dominates digital communication but remains largely unsearchable by AI, their purpose-built embedding and generation models solve a problem that even the largest frontier labs have only superficially addressed. If video is the dominant medium of the internet, whoever cracks video understanding at production scale holds a strategic position comparable to what Google Search holds for text.

Deep Dive

Twelve Labs was founded in 2021 by Jae Lee and Aiden Lee, who saw a massive gap in the AI landscape: while text-based models were advancing at breakneck speed, video remained stubbornly opaque to machines. You could ask an LLM to summarize a document in seconds, but asking it what happened at minute 14:32 of a two-hour video? Impossible. The founding team, with roots in computer vision research and experience at companies like Google and Samsung, recognized that video understanding required a fundamentally different approach from bolting image recognition onto a timeline. They set out to build multimodal foundation models that understand video natively — treating visual scenes, audio, speech, and on-screen text as a unified stream rather than separate channels stitched together after the fact.

Pegasus and Marengo: The Product Stack

Twelve Labs' core products are Pegasus and Marengo, each tackling a different piece of the video intelligence problem. Marengo is their video embedding model — it converts video content into rich vector representations that enable semantic search across massive video libraries. You can query "person in a red jacket opening a door" across thousands of hours of footage and get precise timestamp-level results, even if no one ever tagged or captioned that moment. Pegasus is their video-to-text generation model, capable of summarizing, describing, and answering questions about video content with a specificity that generic vision-language models struggle to match. Together, these models power an API that lets developers build applications like media asset management, compliance monitoring, content moderation, and educational video search without needing to build their own video ML pipeline from scratch.

Funding and Market Position

The company raised a $50 million Series A in 2024 led by NEA and NVentures (NVIDIA's venture arm), with participation from Index Ventures and existing investors. This brought their total funding past $70 million. The NVIDIA investment was particularly significant — it signaled that the GPU maker saw video understanding as a distinct, high-value market segment worth betting on, not just a feature that would eventually get absorbed into general-purpose multimodal models from OpenAI or Google. Twelve Labs has been deliberate about positioning themselves as infrastructure, not an end-user application. Their API-first approach means they don't compete with their customers; they're the plumbing that makes video-native AI applications possible across industries from media and entertainment to security and healthcare.

The Video Understanding Gap

The reason Twelve Labs has space to exist in a market dominated by well-funded generalist labs is that video is genuinely hard. A single hour of video at 30 frames per second contains 108,000 images, plus audio, speech, text overlays, and temporal relationships between all of them. General-purpose multimodal models like GPT-4o and Gemini can process short video clips, but they struggle with the scale, precision, and speed that production video applications demand. Twelve Labs' purpose-built architecture is designed for exactly this problem: fast indexing of massive video libraries, sub-second search across hundreds of thousands of hours, and generation tasks that require understanding what happened over time, not just in a single frame. As video continues to dominate internet traffic and enterprise data — Cisco estimates video will represent 82% of all IP traffic — the companies that can make that content searchable and actionable will own a uniquely valuable piece of the AI stack.

Related Concepts

← All Terms
← Tripo Upstage →
ESC