Zubnet AI學習Wiki › Vision
Using AI

Vision

Multimodal Vision, Image Understanding
語言模型在文字之外也理解並推理影像的能力。你發一張照片,問「這張圖裡是什麼?」,或上傳一張圖表,問「總結一下趨勢」。有視覺能力的模型(Claude、GPT-4V、Gemini)把影像編碼成語言模型可以和文字 token 一起處理的 token,實現統一的文字-影像推理。

為什麼重要

視覺改變了 LLM 能做的事情。不是用文字描述一個 bug,你給它截圖。不是打出一張表,你拍下來。不是解釋一張圖表,你分享。視覺讓 AI 對那些光靠文字不夠的任務變得可及 — 而大多數真實世界任務都是這樣。對普通使用者來說,這是最有影響力的多模態能力。

Deep Dive

The typical architecture: images are processed by a vision encoder (usually a Vision Transformer or CLIP variant) that converts image pixels into a sequence of visual tokens. These tokens are projected into the same embedding space as text tokens and concatenated with the text input. The language model then processes both visual and text tokens together through its standard attention layers, enabling cross-modal reasoning.

What Models Can (and Can't) See

Current vision models excel at: describing image content, reading text in images (OCR), understanding charts and diagrams, identifying objects and people (when appropriate), and reasoning about spatial relationships. They struggle with: precise counting (especially in cluttered scenes), fine-grained spatial reasoning ("is A above or below B?"), reading small or stylized text, and understanding images that require domain expertise (medical scans, specialized equipment).

Resolution and Cost

Higher resolution images produce more visual tokens, consuming more context window and costing more. Most providers automatically resize or tile images to balance quality and cost. A typical image might produce 500–2000 tokens. Understanding this helps you optimize: don't send a 4K screenshot when a 1080p crop of the relevant area would work better and cost less.

相關概念

← 所有術語
← Vidu Vision Transformer →