OCR: Definition & Meaning — AI Wiki

Images से text extract करना — documents की photos, screenshots, signs, handwritten notes, या text वाली कोई भी image। Modern OCR text detection (image में text कहाँ appear होता है ये ढूँढना) और text recognition (text क्या कहता है ये पढ़ना) combine करता है। Deep learning OCR curved text, multiple languages, varied fonts, और poor image quality को बहुत बेहतर handle करता है पुरानी rule-based approaches से।

यह क्यों matter करता है

OCR physical world को digitize करता है। Expense tracking के लिए receipts scan करना, archival के लिए documents पढ़ना, forms से data extract करना, signs को real-time में translate करना, और image-based PDFs को searchable बनाना — सब OCR पर depend करते हैं। LLMs के साथ combine करने पर, OCR sophisticated document understanding enable करता है — सिर्फ text पढ़ना नहीं बल्कि invoices, contracts, और reports समझना।

Deep Dive

Modern OCR pipelines have two stages: detection (finding text regions using models like CRAFT or DBNet) and recognition (reading text in each region using CRNN or Transformer-based models). End-to-end approaches (like PaddleOCR, EasyOCR) combine both stages. For structured documents, specialized models (LayoutLM, Donut) understand both text content and spatial layout, recognizing that "Total: $42.50" on an invoice means something different from the same text in a paragraph.

Vision LLMs as OCR

Multimodal LLMs (Claude, GPT-4V, Gemini) have become remarkably good at OCR as a side effect of their vision capabilities. You can upload an image and ask "read all text in this image" or "extract the table from this receipt." For complex documents with mixed layouts, handwriting, and multiple languages, vision LLMs often outperform dedicated OCR systems because they understand context and can handle ambiguity. The trade-off is speed and cost — dedicated OCR is 100x faster for bulk processing.

Challenges

Remaining hard problems: handwriting recognition (especially cursive or messy handwriting), degraded historical documents, text in complex backgrounds (wild text on signs, clothing, products), and scripts with complex character compositions (Chinese, Arabic, Devanagari). Accuracy varies significantly by language and script — Latin script OCR is nearly solved, but CJK and right-to-left scripts still have meaningful error rates.

OCR

यह क्यों matter करता है

Deep Dive

Vision LLMs as OCR

Challenges

संबंधित अवधारणाएँ