Whisper
Overview
From-scratch replication of OpenAI Whisper, a sequence-to-sequence ASR model. The audio encoder processes mel spectrograms with 1D convolutions before feeding into transformer layers; the decoder autoregressively generates transcription tokens. Based on Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., OpenAI 2022).
Architecture
Audio Encoder:
- 80-channel mel spectrogram (16kHz, 25ms window, 10ms stride, up to 500 time steps)
- Two 1D Conv layers (kernel=3, stride=2) for downsampling
- Transformer encoder on top
Decoder:
- 6 layers, 6 heads, 384-dim embeddings
- Vocab size: 50,262
- Cross-attention to encoder outputs
Training
| Hyperparameter | Value |
|---|---|
| Dataset | GigaSpeech (‘xs’ snapshot, HuggingFace) |
| Epochs | 10 |
| Optimizer | Adam, lr=2e-4 |
| Batch size | 64 |
| Sequence length | 64 |
| Dropout | 0.1 |
Paper
Robust Speech Recognition via Large-Scale Weak Supervision — Radford et al., OpenAI 2022