Vision Transformer: Definition & Meaning — AI Wiki

套用到影像上的 Transformer 架構:把一張影像切成固定大小的 patch(比如 16×16 像素),把每個 patch 當作一個「token」,用標準 Transformer attention 處理這個 patch 序列。ViT(Dosovitskiy 等人,2020)展示了當訓練資料足夠時,Transformer 能在影像任務上匹敵或超過 CNN,把語言和視覺的架構統一起來。

為什麼重要

ViT 證明了 Transformer 是一個通用架構 — 不僅是文字,也包括影像。這種統一促成了多模態模型的爆發:如果影像和文字都是同一種架構處理的 token 序列,那麼把它們組合起來就變得自然。ViT 是 CLIP 中的影像編碼器、DiT 的骨幹,以及現代電腦視覺的基礎。

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

為什麼重要

Deep Dive

ViT vs. CNN

Beyond Classification

相關概念