TTS

Audio/Speech PyTorch GigaSpeech

Overview

From-scratch transformer-based Text-to-Speech model in the style of Tacotron 2. Takes phoneme sequences as input and predicts mel spectrogram frames autoregressively, which a vocoder then converts to audio. A WaveNet vocoder is planned but not yet implemented.

Architecture

Encoder: Phoneme embeddings (512-dim) → transformer encoder Decoder: 8 layers, 4 heads, 256-dim, hidden=2048, 80-token block size

Audio spec: 16kHz, 80-channel mel spectrogram, 50ms window, 12.5ms stride, up to 512 time steps

Training

Hyperparameter	Value
Dataset	GigaSpeech
Epochs	10 (150 steps/epoch)
Batch size	32
Optimizer	AdamW, lr=6e-4, weight decay=0.01
Gradient clipping	1.0
Val frequency	Every 50 steps

Paper

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions — Shen et al. (Tacotron 2), 2018

Yuvraj Singh

Overview

Architecture

Training

Paper