Zubnet AIAprenderWiki › Twelve Labs
Empresas

Twelve Labs

También conocido como: Video search, Pegasus, Marengo
Compañía de comprensión de video que te permite buscar, analizar y generar contenido desde video usando lenguaje natural. Piénsalo como «RAG para video» — sus modelos entienden lo que pasa en un video de la forma en que los LLMs entienden texto.

Por qué importa

Twelve Labs está construyendo la infraestructura fundacional para hacer legible por máquina el contenido de video del mundo. En una era donde el video domina la comunicación digital pero permanece ampliamente no buscable por IA, sus modelos de embedding y generación purpose-built resuelven un problema que incluso los labs de frontera más grandes solo han atendido superficialmente. Si el video es el medio dominante de internet, quien resuelva la comprensión de video a escala de producción tiene una posición estratégica comparable a la que Google Search tiene para texto.

Deep Dive

Twelve Labs was founded in 2021 by Jae Lee and Aiden Lee, who saw a massive gap in the AI landscape: while text-based models were advancing at breakneck speed, video remained stubbornly opaque to machines. You could ask an LLM to summarize a document in seconds, but asking it what happened at minute 14:32 of a two-hour video? Impossible. The founding team, with roots in computer vision research and experience at companies like Google and Samsung, recognized that video understanding required a fundamentally different approach from bolting image recognition onto a timeline. They set out to build multimodal foundation models that understand video natively — treating visual scenes, audio, speech, and on-screen text as a unified stream rather than separate channels stitched together after the fact.

Pegasus and Marengo: The Product Stack

Twelve Labs' core products are Pegasus and Marengo, each tackling a different piece of the video intelligence problem. Marengo is their video embedding model — it converts video content into rich vector representations that enable semantic search across massive video libraries. You can query "person in a red jacket opening a door" across thousands of hours of footage and get precise timestamp-level results, even if no one ever tagged or captioned that moment. Pegasus is their video-to-text generation model, capable of summarizing, describing, and answering questions about video content with a specificity that generic vision-language models struggle to match. Together, these models power an API that lets developers build applications like media asset management, compliance monitoring, content moderation, and educational video search without needing to build their own video ML pipeline from scratch.

Funding and Market Position

The company raised a $50 million Series A in 2024 led by NEA and NVentures (NVIDIA's venture arm), with participation from Index Ventures and existing investors. This brought their total funding past $70 million. The NVIDIA investment was particularly significant — it signaled that the GPU maker saw video understanding as a distinct, high-value market segment worth betting on, not just a feature that would eventually get absorbed into general-purpose multimodal models from OpenAI or Google. Twelve Labs has been deliberate about positioning themselves as infrastructure, not an end-user application. Their API-first approach means they don't compete with their customers; they're the plumbing that makes video-native AI applications possible across industries from media and entertainment to security and healthcare.

The Video Understanding Gap

The reason Twelve Labs has space to exist in a market dominated by well-funded generalist labs is that video is genuinely hard. A single hour of video at 30 frames per second contains 108,000 images, plus audio, speech, text overlays, and temporal relationships between all of them. General-purpose multimodal models like GPT-4o and Gemini can process short video clips, but they struggle with the scale, precision, and speed that production video applications demand. Twelve Labs' purpose-built architecture is designed for exactly this problem: fast indexing of massive video libraries, sub-second search across hundreds of thousands of hours, and generation tasks that require understanding what happened over time, not just in a single frame. As video continues to dominate internet traffic and enterprise data — Cisco estimates video will represent 82% of all IP traffic — the companies that can make that content searchable and actionable will own a uniquely valuable piece of the AI stack.

Conceptos relacionados

← Todos los términos
← Tripo Unsupervised Aprendering →
ESC