OCR: Definition & Meaning — AI Wiki

从图像中提取文字 — 文档照片、屏幕截图、标牌、手写笔记,或任何包含文字的图像。现代 OCR 结合文字检测(找到文字在图像中的位置)和文字识别(读出文字内容)。深度学习 OCR 处理弯曲文字、多语言、各种字体、差图像质量的能力,远超老的基于规则的方法。

为什么重要

OCR 把物理世界数字化。扫描收据做开支管理、读文档做归档、从表单提取数据、实时翻译标牌、让基于图像的 PDF 可搜索,都依赖 OCR。结合 LLM,OCR 实现了高级的文档理解 — 不只是读文字,而是理解发票、合同、报告。

Deep Dive

Modern OCR pipelines have two stages: detection (finding text regions using models like CRAFT or DBNet) and recognition (reading text in each region using CRNN or Transformer-based models). End-to-end approaches (like PaddleOCR, EasyOCR) combine both stages. For structured documents, specialized models (LayoutLM, Donut) understand both text content and spatial layout, recognizing that "Total: $42.50" on an invoice means something different from the same text in a paragraph.

Vision LLMs as OCR

Multimodal LLMs (Claude, GPT-4V, Gemini) have become remarkably good at OCR as a side effect of their vision capabilities. You can upload an image and ask "read all text in this image" or "extract the table from this receipt." For complex documents with mixed layouts, handwriting, and multiple languages, vision LLMs often outperform dedicated OCR systems because they understand context and can handle ambiguity. The trade-off is speed and cost — dedicated OCR is 100x faster for bulk processing.

Challenges

Remaining hard problems: handwriting recognition (especially cursive or messy handwriting), degraded historical documents, text in complex backgrounds (wild text on signs, clothing, products), and scripts with complex character compositions (Chinese, Arabic, Devanagari). Accuracy varies significantly by language and script — Latin script OCR is nearly solved, but CJK and right-to-left scripts still have meaningful error rates.

OCR

为什么重要

Deep Dive

Vision LLMs as OCR

Challenges

相关概念