Zubnet AI学习Wiki › Vision
Using AI

Vision

Multimodal Vision, Image Understanding
语言模型在文本之外也理解并推理图像的能力。你发一张照片,问“这张图里是什么?”,或上传一张图表,问“总结一下趋势”。有视觉能力的模型(Claude、GPT-4V、Gemini)把图像编码成语言模型可以和文本 token 一起处理的 token,实现统一的文本-图像推理。

为什么重要

视觉改变了 LLM 能做的事情。不是用文字描述一个 bug,你给它截图。不是打出一张表,你拍下来。不是解释一张图表,你分享。视觉让 AI 对那些光靠文字不够的任务变得可及 — 而大多数真实世界任务都是这样。对普通用户来说,这是最有影响力的多模态能力。

Deep Dive

The typical architecture: images are processed by a vision encoder (usually a Vision Transformer or CLIP variant) that converts image pixels into a sequence of visual tokens. These tokens are projected into the same embedding space as text tokens and concatenated with the text input. The language model then processes both visual and text tokens together through its standard attention layers, enabling cross-modal reasoning.

What Models Can (and Can't) See

Current vision models excel at: describing image content, reading text in images (OCR), understanding charts and diagrams, identifying objects and people (when appropriate), and reasoning about spatial relationships. They struggle with: precise counting (especially in cluttered scenes), fine-grained spatial reasoning ("is A above or below B?"), reading small or stylized text, and understanding images that require domain expertise (medical scans, specialized equipment).

Resolution and Cost

Higher resolution images produce more visual tokens, consuming more context window and costing more. Most providers automatically resize or tile images to balance quality and cost. A typical image might produce 500–2000 tokens. Understanding this helps you optimize: don't send a 4K screenshot when a 1080p crop of the relevant area would work better and cost less.

相关概念

← 所有术语
← Vidu Vision Transformer →