Vision Transformer: Definition & Meaning — AI Wiki

Uma arquitetura Transformer aplicada a imagens dividindo uma imagem em patches de tamanho fixo (ex. 16×16 pixels), tratando cada patch como um “token”, e processando a sequência de patches com a atenção padrão do Transformer. ViT (Dosovitskiy et al., 2020) mostrou que Transformers podiam igualar ou exceder CNNs em tarefas de imagem quando treinados em dados suficientes, unificando as arquiteturas para linguagem e visão.

Por que importa

ViT provou que o Transformer é uma arquitetura universal — não só para texto mas também para imagens. Essa unificação habilitou a explosão de modelos multimodais: se imagens e texto são ambos sequências de tokens processadas pela mesma arquitetura, combiná-los se torna natural. ViT é o encoder de imagem no CLIP, a espinha dorsal do DiT, e a fundação da visão computacional moderna.

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

Por que importa

Deep Dive

ViT vs. CNN

Beyond Classification

Conceitos relacionados