Information Extraction: Definition & Meaning — AI Wiki

Extraire automatiquement de l'information structurée à partir de texte non structuré. Étant donné un article de nouvelles, extraire : qui a fait quoi, quand, où, et pourquoi. Étant donné un contrat, extraire : parties, dates, obligations et montants. L'IE combine le NER (trouver les entités), l'extraction de relations (trouver les connexions entre entités) et l'extraction d'événements (trouver ce qui s'est passé) dans un pipeline unifié.

Pourquoi c'est important

La plupart de l'information du monde est coincée dans du texte non structuré — emails, rapports, articles, documents légaux, dossiers médicaux. L'extraction d'information transforme ce texte en données structurées qui peuvent être cherchées, analysées et traitées. C'est la technologie qui te laisse poser une question de type base de données sur une pile de documents.

Deep Dive

The IE pipeline traditionally has three stages: entity extraction (find all mentions of people, organizations, dates, amounts), relation extraction (determine relationships: "Company X acquired Company Y for $Z"), and coreference resolution (recognize that "the company," "Apple," and "it" all refer to the same entity). Each stage builds on the previous one to produce structured, linked information.

LLMs Changed Everything

LLMs collapsed the IE pipeline into a single prompt: "Extract all companies, people, amounts, and dates from this text. For each, identify their relationships. Return as JSON." This works remarkably well for common extraction tasks and eliminates the need for separate models for each subtask. The trade-off: LLM extraction is slower and more expensive than dedicated models, and less predictable in output format (structured output modes help).

Document Understanding

Modern IE goes beyond text: document understanding models (LayoutLM, Donut) extract information from visually-rich documents (invoices, receipts, forms) by understanding both text content and spatial layout. "Total: $42.50" in the bottom-right of an invoice means something different from the same text in a body paragraph. These models combine OCR, layout analysis, and NLP to extract structured data from real-world documents.

Information Extraction

Pourquoi c'est important

Deep Dive

LLMs Changed Everything

Document Understanding

Concepts liés