CLAP

Audio/Speech PyTorch GigaSpeech

Overview

From-scratch replication of CLAP (Contrastive Language-Audio Pretraining) — the audio equivalent of CLIP. A text encoder and an audio encoder are jointly trained with contrastive loss to align audio clips with natural language descriptions in a shared embedding space. Based on Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (Wu et al., 2023).

Architecture

Text embeddings: 768-dim
Audio embeddings: 2048-dim
Output space: 1024-dim shared embedding
Audio features: 44.1kHz, 64 mel bins, 1024 FFT window, 320 hop length, 50–8000 Hz
Learning rates: Differentiated by component — head (1e-3), audio encoder (1e-4), text encoder (1e-5)

Training

Hyperparameter	Value
Dataset	GigaSpeech (‘xs’ snapshot)
Epochs	30
Batch size	32
LR	4e-4

Known issue: Loss plateaued at 2.079 (≈ −log(1/8)), indicating the logit scale wasn’t large enough for the softmax to assign high probability to correct pairs. Documented as an unresolved training instability.

Paper

CLAP: Learning Audio Concepts from Natural Language Supervision — Wu et al., 2023

Yuvraj Singh

Overview

Architecture

Training

Paper