CLiP

Vision-Language PyTorch Flickr8K
GitHub →

Overview

From-scratch PyTorch replication of CLIP (Contrastive Language-Image Pre-training). CLIP jointly trains an image encoder and a text encoder to maximise cosine similarity between matched image-text pairs and minimise it for unmatched pairs — enabling zero-shot image classification by comparing image embeddings to text prompt embeddings. Based on Learning Transferable Visual Models From Natural Language Supervision (Radford et al., OpenAI 2021).

Architecture

  • Image encoder: Vision Transformer (ViT) or ResNet backbone
  • Text encoder: Transformer encoder
  • Loss: Symmetric cross-entropy over cosine similarity matrix (NT-Xent / InfoNCE)
  • Output: Joint embedding space

Training

Hyperparameter Value
Dataset Flickr8K (image-caption pairs)
Epochs 30
Hardware T4 GPU

Results

Split Loss
Train 1.3
Validation 2.2

Paper

Learning Transferable Visual Models From Natural Language Supervision — Radford et al., OpenAI 2021