Information Extraction: Definition & Meaning — AI Wiki

从非结构化文本中自动提取结构化信息。给一篇新闻文章,提取:谁、做了什么、何时、何地、为什么。给一份合同,提取:当事方、日期、义务、金额。IE 把 NER(找实体)、关系提取(找实体之间的连接)、事件提取(找发生了什么)结合成一个统一的 pipeline。

为什么重要

世界上大部分信息都困在非结构化文本中 — 邮件、报告、文章、法律文件、医疗记录。信息提取把这些文本变成结构化数据,可以被搜索、分析、采取行动。这就是让你能对一堆文档问数据库式问题的技术。

Deep Dive

The IE pipeline traditionally has three stages: entity extraction (find all mentions of people, organizations, dates, amounts), relation extraction (determine relationships: "Company X acquired Company Y for $Z"), and coreference resolution (recognize that "the company," "Apple," and "it" all refer to the same entity). Each stage builds on the previous one to produce structured, linked information.

LLMs Changed Everything

LLMs collapsed the IE pipeline into a single prompt: "Extract all companies, people, amounts, and dates from this text. For each, identify their relationships. Return as JSON." This works remarkably well for common extraction tasks and eliminates the need for separate models for each subtask. The trade-off: LLM extraction is slower and more expensive than dedicated models, and less predictable in output format (structured output modes help).

Document Understanding

Modern IE goes beyond text: document understanding models (LayoutLM, Donut) extract information from visually-rich documents (invoices, receipts, forms) by understanding both text content and spatial layout. "Total: $42.50" in the bottom-right of an invoice means something different from the same text in a body paragraph. These models combine OCR, layout analysis, and NLP to extract structured data from real-world documents.

Information Extraction

为什么重要

Deep Dive

LLMs Changed Everything

Document Understanding

相关概念