SigLip

Vision-Language PyTorch Flickr8K
GitHub →

Overview

From-scratch replication of SigLIP (Sigmoid Loss for Language Image Pre-Training). SigLIP replaces CLIP’s softmax-based contrastive loss with a pairwise sigmoid loss, treating the problem as independent binary classification for each image-text pair. This removes the dependency on global batch normalisation and improves scalability. Based on Sigmoid Loss for Language Image Pre-Training (Zhai et al., 2023).

Architecture

  • Image encoder + text encoder (same backbone structure as CLIP)
  • Loss: Pairwise sigmoid cross-entropy — no global softmax over the batch
  • Each (image, text) pair independently classified as matching or not

Training

  • Dataset: Flickr8K
  • Framework: PyTorch

Paper

Sigmoid Loss for Language Image Pre-Training — Zhai et al., Google 2023