Vision Transformer: Definition & Meaning — AI Wiki

Images पर apply किया गया एक Transformer architecture:एक image को fixed-size patches में split करके (जैसे 16×16 pixels), हर patch को एक “token” की तरह treat करके, और patches की sequence को standard Transformer attention से process करके। ViT (Dosovitskiy et al., 2020) ने दिखाया कि Transformers काफी data पर train होने पर image tasks में CNNs को match या exceed कर सकते हैं, language और vision की architectures को unify करते हुए।

यह क्यों matter करता है

ViT ने prove किया कि Transformer एक universal architecture है — सिर्फ text के लिए नहीं बल्कि images के लिए भी। इस unification ने multimodal models का explosion enable किया: अगर images और text दोनों same architecture से process किए जाने वाले tokens की sequences हैं, तो उन्हें combine करना natural हो जाता है। ViT CLIP में image encoder, DiT की backbone, और modern computer vision की foundation है।

Deep Dive

The process: (1) split a 224×224 image into 196 patches of 16×16 pixels, (2) flatten each patch into a vector and project it through a linear layer to create patch embeddings, (3) add positional embeddings so the model knows where each patch is, (4) prepend a [CLS] token whose final representation is used for classification, (5) process through standard Transformer encoder layers. The output is a sequence of patch representations that can be used for classification, detection, or as features for other models.

ViT vs. CNN

CNNs have built-in inductive biases: locality (nearby pixels are related) and translation equivariance (patterns are recognized regardless of position). ViT has neither — it treats patches as an unordered set (position comes from learned embeddings) and attends to all patches equally. This makes ViT less data-efficient than CNNs for small datasets but more powerful for large datasets, where it can learn these biases from data rather than having them hard-coded.

Beyond Classification

ViT spawned a family of vision Transformers: DeiT (data-efficient training), Swin Transformer (hierarchical vision with shifted windows), MAE (masked autoencoder for self-supervised vision), and DINO/DINOv2 (self-supervised visual representations). These models now dominate vision tasks: image classification, object detection, segmentation, and feature extraction. The ViT architecture is also the image encoder in most multimodal models (LLaVA, GPT-4V).

Vision Transformer

यह क्यों matter करता है

Deep Dive

ViT vs. CNN

Beyond Classification

संबंधित अवधारणाएँ