Vision Transformer: Definition & Meaning — AI Wiki

Una arquitectura Transformer aplicada a imágenes dividiendo una imagen en parches de tamaño fijo (p. ej. 16×16 píxeles), tratando cada parche como un «token», y procesando la secuencia de parches con la atención estándar del Transformer. ViT (Dosovitskiy et al., 2020) mostró que los Transformers podían igualar o superar a los CNNs en tareas de imagen cuando se entrenan con suficientes datos, unificando las arquitecturas para lenguaje y visión.

Por qué importa

ViT demostró que el Transformer es una arquitectura universal — no solo para texto sino también para imágenes. Esta unificación habilitó la explosión de modelos multimodales: si imágenes y texto son ambos secuencias de tokens procesadas por la misma arquitectura, combinarlas se vuelve natural. ViT es el encoder de imagen en CLIP, la espina dorsal de DiT, y el fundamento de la visión por computadora moderna.

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

Por qué importa

Deep Dive

ViT vs. CNN

Beyond Classification

Conceptos relacionados