Zubnet AIAprenderWiki › Information Extraction
Using AI

Information Extraction

IE, Structured Extraction
Extraer automáticamente información estructurada de texto no estructurado. Dado un artículo de noticias, extraer: quién hizo qué, cuándo, dónde y por qué. Dado un contrato, extraer: partes, fechas, obligaciones y montos. La IE combina NER (encontrar entidades), extracción de relaciones (encontrar conexiones entre entidades) y extracción de eventos (encontrar qué pasó) en un pipeline unificado.

Por qué importa

La mayor parte de la información del mundo está atrapada en texto no estructurado — emails, informes, artículos, documentos legales, registros médicos. La extracción de información convierte este texto en data estructurada que se puede buscar, analizar y sobre la que se puede actuar. Es la tecnología que te permite hacer una pregunta tipo base de datos sobre una pila de documentos.

Deep Dive

The IE pipeline traditionally has three stages: entity extraction (find all mentions of people, organizations, dates, amounts), relation extraction (determine relationships: "Company X acquired Company Y for $Z"), and coreference resolution (recognize that "the company," "Apple," and "it" all refer to the same entity). Each stage builds on the previous one to produce structured, linked information.

LLMs Changed Everything

LLMs collapsed the IE pipeline into a single prompt: "Extract all companies, people, amounts, and dates from this text. For each, identify their relationships. Return as JSON." This works remarkably well for common extraction tasks and eliminates the need for separate models for each subtask. The trade-off: LLM extraction is slower and more expensive than dedicated models, and less predictable in output format (structured output modes help).

Document Understanding

Modern IE goes beyond text: document understanding models (LayoutLM, Donut) extract information from visually-rich documents (invoices, receipts, forms) by understanding both text content and spatial layout. "Total: $42.50" in the bottom-right of an invoice means something different from the same text in a body paragraph. These models combine OCR, layout analysis, and NLP to extract structured data from real-world documents.

Conceptos relacionados

← Todos los términos
ESC