SigLip
Overview
From-scratch replication of SigLIP (Sigmoid Loss for Language Image Pre-Training). SigLIP replaces CLIP’s softmax-based contrastive loss with a pairwise sigmoid loss, treating the problem as independent binary classification for each image-text pair. This removes the dependency on global batch normalisation and improves scalability. Based on Sigmoid Loss for Language Image Pre-Training (Zhai et al., 2023).
Architecture
- Image encoder + text encoder (same backbone structure as CLIP)
- Loss: Pairwise sigmoid cross-entropy — no global softmax over the batch
- Each (image, text) pair independently classified as matching or not
Training
- Dataset: Flickr8K
- Framework: PyTorch
Paper
Sigmoid Loss for Language Image Pre-Training — Zhai et al., Google 2023