Twelve Labs: Definition & Meaning — AI Wiki

Compagnie de compréhension vidéo qui te laisse chercher, analyser et générer du contenu à partir de vidéo en utilisant le langage naturel. Pense à ça comme « RAG pour la vidéo » — leurs modèles comprennent ce qui se passe dans une vidéo de la même façon que les LLM comprennent le texte.

Pourquoi c'est important

Twelve Labs construit l'infrastructure fondamentale pour rendre le contenu vidéo du monde machine-readable. Dans une ère où la vidéo domine la communication numérique mais reste largement non cherchable par l'IA, leurs modèles d'embedding et de génération purpose-built résolvent un problème que même les plus grands labos de frontière n'ont adressé que superficiellement. Si la vidéo est le médium dominant de l'internet, quiconque craque la compréhension vidéo à échelle de production détient une position stratégique comparable à ce que Google Search détient pour le texte.

Deep Dive

Twelve Labs was founded in 2021 by Jae Lee and Aiden Lee, who saw a massive gap in the AI landscape: while text-based models were advancing at breakneck speed, video remained stubbornly opaque to machines. You could ask an LLM to summarize a document in seconds, but asking it what happened at minute 14:32 of a two-hour video? Impossible. The founding team, with roots in computer vision research and experience at companies like Google and Samsung, recognized that video understanding required a fundamentally different approach from bolting image recognition onto a timeline. They set out to build multimodal foundation models that understand video natively — treating visual scenes, audio, speech, and on-screen text as a unified stream rather than separate channels stitched together after the fact.

Pegasus and Marengo: The Product Stack

Twelve Labs' core products are Pegasus and Marengo, each tackling a different piece of the video intelligence problem. Marengo is their video embedding model — it converts video content into rich vector representations that enable semantic search across massive video libraries. You can query "person in a red jacket opening a door" across thousands of hours of footage and get precise timestamp-level results, even if no one ever tagged or captioned that moment. Pegasus is their video-to-text generation model, capable of summarizing, describing, and answering questions about video content with a specificity that generic vision-language models struggle to match. Together, these models power an API that lets developers build applications like media asset management, compliance monitoring, content moderation, and educational video search without needing to build their own video ML pipeline from scratch.

Funding and Market Position

The company raised a $50 million Series A in 2024 led by NEA and NVentures (NVIDIA's venture arm), with participation from Index Ventures and existing investors. This brought their total funding past $70 million. The NVIDIA investment was particularly significant — it signaled that the GPU maker saw video understanding as a distinct, high-value market segment worth betting on, not just a feature that would eventually get absorbed into general-purpose multimodal models from OpenAI or Google. Twelve Labs has been deliberate about positioning themselves as infrastructure, not an end-user application. Their API-first approach means they don't compete with their customers; they're the plumbing that makes video-native AI applications possible across industries from media and entertainment to security and healthcare.

The Video Understanding Gap

The reason Twelve Labs has space to exist in a market dominated by well-funded generalist labs is that video is genuinely hard. A single hour of video at 30 frames per second contains 108,000 images, plus audio, speech, text overlays, and temporal relationships between all of them. General-purpose multimodal models like GPT-4o and Gemini can process short video clips, but they struggle with the scale, precision, and speed that production video applications demand. Twelve Labs' purpose-built architecture is designed for exactly this problem: fast indexing of massive video libraries, sub-second search across hundreds of thousands of hours, and generation tasks that require understanding what happened over time, not just in a single frame. As video continues to dominate internet traffic and enterprise data — Cisco estimates video will represent 82% of all IP traffic — the companies that can make that content searchable and actionable will own a uniquely valuable piece of the AI stack.

Twelve Labs

Pourquoi c'est important

Deep Dive

Pegasus and Marengo: The Product Stack

Funding and Market Position

The Video Understanding Gap

Concepts liés