Vision Transformer: Definition & Meaning — AI Wiki

A Transformer architecture applied to images by splitting an image into fixed-size patches (e.g., 16×16 pixels), treating each patch as a "token," and processing the sequence of patches with standard Transformer attention. ViT (Dosovitskiy et al., 2020) showed that Transformers could match or exceed CNNs on image tasks when trained on enough data, unifying the architectures for language and vision.

Why it matters

ViT proved that the Transformer is a universal architecture — not just for text but for images too. This unification enabled the explosion of multimodal models: if images and text are both sequences of tokens processed by the same architecture, combining them becomes natural. ViT is the image encoder in CLIP, the backbone of DiT, and the foundation of modern computer vision.

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

Why it matters

Deep Dive

ViT vs. CNN

Beyond Classification

Related Concepts