Vision Transformer: Definition & Meaning — AI Wiki

应用到图像上的 Transformer 架构:把一张图像切成固定大小的 patch(比如 16×16 像素),把每个 patch 当作一个“token”,用标准 Transformer attention 处理这个 patch 序列。ViT(Dosovitskiy 等人,2020)展示了当训练数据足够时,Transformer 能在图像任务上匹敌或超过 CNN,把语言和视觉的架构统一起来。

为什么重要

ViT 证明了 Transformer 是一个通用架构 — 不仅是文本,也包括图像。这种统一促成了多模态模型的爆发:如果图像和文本都是同一种架构处理的 token 序列,那么把它们组合起来就变得自然。ViT 是 CLIP 中的图像编码器、DiT 的骨干,以及现代计算机视觉的基础。

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

为什么重要

Deep Dive

ViT vs. CNN

Beyond Classification

相关概念