Zubnet AI學習Wiki › CLIP
Models

CLIP

Contrastive Language-Image Pre-training
OpenAI 的一個模型(2021),透過在 4 億對影像-說明上訓練學會連接影像和文字。CLIP 把影像和文字編碼到同一個 embedding 空間,其中匹配的圖文對靠得近,不匹配的靠得遠。它是大多數現代多模態 AI 系統中語言和視覺之間的橋樑。

為什麼重要

CLIP 是文字到影像生成(Stable Diffusion、DALL-E)、影像搜尋、零樣本影像分類、多模態理解的骨幹。當你輸入一個 prompt 並得到一張圖,CLIP(或其後代)就是把你的詞語與視覺概念連接的東西。它證明了你可以只從自然語言監督中學到強大的視覺表示,不需要有標籤的影像資料集。

Deep Dive

CLIP trains two encoders simultaneously: a text encoder (Transformer) and an image encoder (ViT or ResNet). During training, a batch of N image-caption pairs produces N text embeddings and N image embeddings. The training objective maximizes cosine similarity for the N correct pairs while minimizing it for the N²−N incorrect pairs. This contrastive objective teaches both encoders to produce aligned representations.

Zero-Shot Classification

CLIP can classify images into categories it was never explicitly trained on. To classify an image as "cat" or "dog," encode the text "a photo of a cat" and "a photo of a dog," encode the image, and pick the text with higher cosine similarity to the image. This zero-shot capability was revolutionary: a single model could handle any classification task by changing the text labels, without any task-specific training data.

CLIP in Diffusion Models

In text-to-image models, CLIP's text encoder converts your prompt into embeddings that guide image generation via cross-attention. The quality of CLIP's text understanding directly affects how well the image matches your prompt. Newer models use stronger text encoders (T5, which understands compositional language better) alongside or instead of CLIP, improving prompt following for complex descriptions. But CLIP's image encoder remains widely used for image understanding tasks.

相關概念

← 所有術語
← Classification Clustering →