Information Extraction: Definition & Meaning — AI Wiki

從非結構化文字中自動擷取結構化資訊。給一篇新聞文章,擷取:誰、做了什麼、何時、何地、為什麼。給一份合約,擷取:當事方、日期、義務、金額。IE 把 NER(找實體)、關係擷取(找實體之間的連接)、事件擷取(找發生了什麼)結合成一個統一的 pipeline。

為什麼重要

世界上大部分資訊都困在非結構化文字中 — 電子郵件、報告、文章、法律文件、醫療記錄。資訊擷取把這些文字變成結構化資料,可以被搜尋、分析、採取行動。這就是讓你能對一堆文件問資料庫式問題的技術。

Deep Dive

The IE pipeline traditionally has three stages: entity extraction (find all mentions of people, organizations, dates, amounts), relation extraction (determine relationships: "Company X acquired Company Y for $Z"), and coreference resolution (recognize that "the company," "Apple," and "it" all refer to the same entity). Each stage builds on the previous one to produce structured, linked information.

LLMs Changed Everything

LLMs collapsed the IE pipeline into a single prompt: "Extract all companies, people, amounts, and dates from this text. For each, identify their relationships. Return as JSON." This works remarkably well for common extraction tasks and eliminates the need for separate models for each subtask. The trade-off: LLM extraction is slower and more expensive than dedicated models, and less predictable in output format (structured output modes help).

Document Understanding

Modern IE goes beyond text: document understanding models (LayoutLM, Donut) extract information from visually-rich documents (invoices, receipts, forms) by understanding both text content and spatial layout. "Total: $42.50" in the bottom-right of an invoice means something different from the same text in a body paragraph. These models combine OCR, layout analysis, and NLP to extract structured data from real-world documents.

Information Extraction

為什麼重要

Deep Dive

LLMs Changed Everything

Document Understanding

相關概念