CLAP
Overview
From-scratch replication of CLAP (Contrastive Language-Audio Pretraining) — the audio equivalent of CLIP. A text encoder and an audio encoder are jointly trained with contrastive loss to align audio clips with natural language descriptions in a shared embedding space. Based on Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (Wu et al., 2023).
Architecture
- Text embeddings: 768-dim
- Audio embeddings: 2048-dim
- Output space: 1024-dim shared embedding
- Audio features: 44.1kHz, 64 mel bins, 1024 FFT window, 320 hop length, 50–8000 Hz
- Learning rates: Differentiated by component — head (1e-3), audio encoder (1e-4), text encoder (1e-5)
Training
| Hyperparameter | Value |
|---|---|
| Dataset | GigaSpeech (‘xs’ snapshot) |
| Epochs | 30 |
| Batch size | 32 |
| LR | 4e-4 |
Known issue: Loss plateaued at 2.079 (≈ −log(1/8)), indicating the logit scale wasn’t large enough for the softmax to assign high probability to correct pairs. Documented as an unresolved training instability.
Paper
CLAP: Learning Audio Concepts from Natural Language Supervision — Wu et al., 2023