Mixtral

Language Models PyTorch TinyShakespeare

Overview

From-scratch replication of Mixtral’s Sparse Mixture-of-Experts architecture. Each transformer layer contains multiple expert FFN sub-networks; a learned router selects the top-K experts per token, enabling a larger total parameter count without proportionally increasing compute. Based on Mixtral of Experts (Mistral AI, 2024).

Architecture

Sparse MoE feed-forward layers with top-K token routing
Auxiliary load-balancing loss to prevent expert collapse
Standard multi-head causal self-attention
RMSNorm, SwiGLU activations

Training

Hyperparameter	Value
Dataset	TinyShakespeare
Steps	1,000 (val every 50)
Hardware	T4 GPU

Results

Split	Loss
Train	2.04
Validation	2.09

Paper

Mixtral of Experts — Mistral AI, 2024

Yuvraj Singh

Overview

Architecture

Training

Results

Paper