Zubnet AI學習Wiki › Multimodal
基礎

Multimodal

一個能夠理解和/或產生多種資料類型的模型:文字、影像、音訊、影片、程式碼。Claude 可以讀圖片和文字;有些模型還能產出影像或語音。「Multimodal」與「unimodal」相對 — 後者只能處理一種類型。

為什麼重要

現實世界的任務本來就是多模態的。你想給 AI 看一張截圖問「這裡哪裡不對?」,或者給它一張圖表說「實作這個」。多模態模型讓這件事成為可能。

Deep Dive

Multimodal AI works by encoding different types of data — text, images, audio, video — into a shared representation space where the model can reason across them. The most common approach uses separate encoder networks for each modality (a vision encoder for images, an audio encoder for speech) that transform raw inputs into sequences of embeddings, which then get fed into a shared Transformer backbone alongside text tokens. This is how models like GPT-4o and Claude handle images: a vision encoder (often a variant of a Vision Transformer, or ViT) converts the image into a grid of "visual tokens" that the language model processes just like text tokens.

Understanding vs Generation

There is an important distinction between multimodal understanding and multimodal generation. Most current chat models are multimodal on the input side — they can read images, PDFs, and sometimes audio — but their output is still primarily text. True multimodal generation, where the same model can produce images, audio, and text natively, is a harder problem. Google's Gemini and OpenAI's GPT-4o push in this direction, but many "multimodal" products actually chain separate specialized models behind the scenes: a language model decides what image to create, then hands a text prompt to a diffusion model like DALL-E or Imagen to actually generate it. The seam between these models matters for quality and coherence.

How Fast It Moved

The evolution here has been rapid. In 2022, getting an AI to reliably describe what was in an image was impressive. By 2024, models could read handwritten notes, interpret complex charts, understand UI screenshots, and follow visual instructions. The practical implications are enormous. Developers use multimodal models to build document processing pipelines that handle scanned PDFs, photos of whiteboards, or mixed text-and-diagram technical specs — all without separate OCR or image classification steps. In Claude's case, you can paste a screenshot of an error message, a photo of a hand-drawn wireframe, or a complex data visualization, and the model reasons about it in context alongside your text instructions.

Where Vision Falls Short

One nuance that trips people up: "multimodal" does not mean "equally good at all modalities." Most multimodal LLMs are still fundamentally language models with vision bolted on. Their text reasoning is typically much stronger than their visual understanding. They might miscount objects in an image, struggle with spatial relationships, or fail to read small text in a screenshot — tasks that feel trivially easy to a human. The vision encoder's resolution matters too: if your image gets downscaled before the model sees it, fine details are lost no matter how smart the language model is. When building production systems, it pays to understand what resolution and token budget your model allocates to images, because that directly affects what visual details it can and cannot perceive.

The Any-to-Any Frontier

The frontier is moving toward what researchers call "any-to-any" models — systems that can take any combination of modalities as input and produce any combination as output. Think: upload a video, get a text summary with relevant still frames pulled out, plus an audio narration. Or describe a scene in text and get a video with synchronized music. We are not fully there yet, but the trajectory is clear. The models that will matter most in the next few years are the ones that dissolve the boundaries between seeing, hearing, reading, writing, and creating, making the modality of your input and output a choice rather than a constraint.

相關概念

← 所有術語
← Multi-Head Attention Natural Language Processing →
ESC