Moonshine

Audio/Speech PyTorch GigaSpeech
GitHub →

Overview

From-scratch replication of Moonshine, a compact ASR model designed for live transcription and voice commands on edge hardware. The architecture prioritises efficiency over raw capacity. Based on Moonshine: Speech Recognition for Live Transcription and Voice Commands (Jeffries et al., 2024).

Architecture

  • 288-dim embeddings, 6 attention heads, 6 decoder layers
  • Lightweight design targeting real-time on-device inference
  • Encoder processes audio features; decoder generates transcription

Training

Hyperparameter Value
Dataset GigaSpeech
Steps 1,500
Batch size 128
Optimizer Adam, lr=6e-4
Val frequency Every 50 steps
Total training time ~25 hours

The model began overfitting at this scale — the README notes that 25 hours of training on GigaSpeech xs was insufficient for generalisation at this parameter count.

Paper

Moonshine: Speech Recognition for Live Transcription and Voice Commands — Jeffries et al., 2024