Zubnet AIसीखेंWiki › Vision
Using AI

Vision

Multimodal Vision, Image Understanding
एक language model की ability कि वो text के साथ-साथ images को भी समझ और उन पर reason कर सके। आप एक photo भेजते हैं और पूछते हैं “इस image में क्या है?” या एक chart upload करके पूछते हैं “trends summarize करो”। Vision-capable models (Claude, GPT-4V, Gemini) images को tokens में encode करते हैं जिन्हें language model text tokens के साथ process करता है, unified text-and-image reasoning enable करते हुए।

यह क्यों matter करता है

Vision ही बदल देती है कि LLMs क्या कर सकते हैं। एक bug को words में describe करने के बजाय, आप उसका screenshot लेते हैं। एक table type करने के बजाय, आप उसकी photo लेते हैं। एक diagram explain करने के बजाय, आप share करते हैं। Vision AI को उन tasks के लिए accessible बनाती है जहाँ अकेला text insufficient है — जो अधिकांश real-world tasks हैं। ये everyday users के लिए सबसे impactful multimodal capability है।

Deep Dive

The typical architecture: images are processed by a vision encoder (usually a Vision Transformer or CLIP variant) that converts image pixels into a sequence of visual tokens. These tokens are projected into the same embedding space as text tokens and concatenated with the text input. The language model then processes both visual and text tokens together through its standard attention layers, enabling cross-modal reasoning.

What Models Can (and Can't) See

Current vision models excel at: describing image content, reading text in images (OCR), understanding charts and diagrams, identifying objects and people (when appropriate), and reasoning about spatial relationships. They struggle with: precise counting (especially in cluttered scenes), fine-grained spatial reasoning ("is A above or below B?"), reading small or stylized text, and understanding images that require domain expertise (medical scans, specialized equipment).

Resolution and Cost

Higher resolution images produce more visual tokens, consuming more context window and costing more. Most providers automatically resize or tile images to balance quality and cost. A typical image might produce 500–2000 tokens. Understanding this helps you optimize: don't send a 4K screenshot when a 1080p crop of the relevant area would work better and cost less.

संबंधित अवधारणाएँ

← सभी Terms
← Vidu Vision Transformer →