OCR: Definition & Meaning — AI Wiki

從影像中提取文字 — 文件照片、螢幕截圖、招牌、手寫筆記,或任何包含文字的影像。現代 OCR 結合文字偵測(找到文字在影像中的位置)和文字辨識(讀出文字內容)。深度學習 OCR 處理彎曲文字、多語言、各種字體、差影像品質的能力,遠超老的基於規則的方法。

為什麼重要

OCR 把物理世界數位化。掃描收據做開支管理、讀文件做歸檔、從表單擷取資料、即時翻譯招牌、讓基於影像的 PDF 可搜尋,都依賴 OCR。結合 LLM,OCR 實現了高級的文件理解 — 不只是讀文字,而是理解發票、合約、報告。

Deep Dive

Modern OCR pipelines have two stages: detection (finding text regions using models like CRAFT or DBNet) and recognition (reading text in each region using CRNN or Transformer-based models). End-to-end approaches (like PaddleOCR, EasyOCR) combine both stages. For structured documents, specialized models (LayoutLM, Donut) understand both text content and spatial layout, recognizing that "Total: $42.50" on an invoice means something different from the same text in a paragraph.

Vision LLMs as OCR

Multimodal LLMs (Claude, GPT-4V, Gemini) have become remarkably good at OCR as a side effect of their vision capabilities. You can upload an image and ask "read all text in this image" or "extract the table from this receipt." For complex documents with mixed layouts, handwriting, and multiple languages, vision LLMs often outperform dedicated OCR systems because they understand context and can handle ambiguity. The trade-off is speed and cost — dedicated OCR is 100x faster for bulk processing.

Challenges

Remaining hard problems: handwriting recognition (especially cursive or messy handwriting), degraded historical documents, text in complex backgrounds (wild text on signs, clothing, products), and scripts with complex character compositions (Chinese, Arabic, Devanagari). Accuracy varies significantly by language and script — Latin script OCR is nearly solved, but CJK and right-to-left scripts still have meaningful error rates.

OCR

為什麼重要

Deep Dive

Vision LLMs as OCR

Challenges

相關概念