Llama4

Language Models PyTorch TinyStories

Overview

From-scratch replication of the Llama 4 Mixture-of-Experts architecture at 1.2B total parameters. Trained to convergence on a Kaggle P100 over 20,000 iterations using Liger kernels for memory efficiency.

Architecture

Experts: 32 experts (12M params each), top-1 routing, 1 shared expert
Load balancing: Auxiliary-free loss
Context window: 1,024 tokens
Config: 768-dim embeddings, 8 heads, 8 decoder layers
Kernels: Liger kernels for fused ops

Training

Hyperparameter	Value
Dataset	TinyStories (~4.2B tokens, 1 epoch)
Iterations	20,000
Optimizer	AdamW, lr=6e-4
Batch size	16
Gradient clipping	1.0
Hardware	Kaggle P100

Results

Split	Loss
Train	2.08
Validation	1.70

Paper

Llama 4 — Meta AI, 2025

Yuvraj Singh

Overview

Architecture

Training

Results

Paper