Zubnet AI学习Wiki › CLIP
Models

CLIP

Contrastive Language-Image Pre-training
OpenAI 的一个模型(2021),通过在 4 亿对图像-说明上训练学会连接图像和文本。CLIP 把图像和文本编码到同一个 embedding 空间,其中匹配的图文对靠得近,不匹配的靠得远。它是大多数现代多模态 AI 系统中语言和视觉之间的桥梁。

为什么重要

CLIP 是文本到图像生成(Stable Diffusion、DALL-E)、图像搜索、零样本图像分类、多模态理解的骨干。当你输入一个 prompt 并得到一张图,CLIP(或其后代)就是把你的词语与视觉概念连接的东西。它证明了你可以只从自然语言监督中学到强大的视觉表示,不需要带标签的图像数据集。

Deep Dive

CLIP trains two encoders simultaneously: a text encoder (Transformer) and an image encoder (ViT or ResNet). During training, a batch of N image-caption pairs produces N text embeddings and N image embeddings. The training objective maximizes cosine similarity for the N correct pairs while minimizing it for the N²−N incorrect pairs. This contrastive objective teaches both encoders to produce aligned representations.

Zero-Shot Classification

CLIP can classify images into categories it was never explicitly trained on. To classify an image as "cat" or "dog," encode the text "a photo of a cat" and "a photo of a dog," encode the image, and pick the text with higher cosine similarity to the image. This zero-shot capability was revolutionary: a single model could handle any classification task by changing the text labels, without any task-specific training data.

CLIP in Diffusion Models

In text-to-image models, CLIP's text encoder converts your prompt into embeddings that guide image generation via cross-attention. The quality of CLIP's text understanding directly affects how well the image matches your prompt. Newer models use stronger text encoders (T5, which understands compositional language better) alongside or instead of CLIP, improving prompt following for complex descriptions. But CLIP's image encoder remains widely used for image understanding tasks.

相关概念

← 所有术语
← Classification Clustering →