Mixtral

Language Models PyTorch TinyShakespeare
GitHub →

Overview

From-scratch replication of Mixtral’s Sparse Mixture-of-Experts architecture. Each transformer layer contains multiple expert FFN sub-networks; a learned router selects the top-K experts per token, enabling a larger total parameter count without proportionally increasing compute. Based on Mixtral of Experts (Mistral AI, 2024).

Architecture

  • Sparse MoE feed-forward layers with top-K token routing
  • Auxiliary load-balancing loss to prevent expert collapse
  • Standard multi-head causal self-attention
  • RMSNorm, SwiGLU activations

Training

Hyperparameter Value
Dataset TinyShakespeare
Steps 1,000 (val every 50)
Hardware T4 GPU

Results

Split Loss
Train 2.04
Validation 2.09

Paper

Mixtral of Experts — Mistral AI, 2024