Llama4
Overview
From-scratch replication of the Llama 4 Mixture-of-Experts architecture at 1.2B total parameters. Trained to convergence on a Kaggle P100 over 20,000 iterations using Liger kernels for memory efficiency.
Architecture
- Experts: 32 experts (12M params each), top-1 routing, 1 shared expert
- Load balancing: Auxiliary-free loss
- Context window: 1,024 tokens
- Config: 768-dim embeddings, 8 heads, 8 decoder layers
- Kernels: Liger kernels for fused ops
Training
| Hyperparameter | Value |
|---|---|
| Dataset | TinyStories (~4.2B tokens, 1 epoch) |
| Iterations | 20,000 |
| Optimizer | AdamW, lr=6e-4 |
| Batch size | 16 |
| Gradient clipping | 1.0 |
| Hardware | Kaggle P100 |
Results
| Split | Loss |
|---|---|
| Train | 2.08 |
| Validation | 1.70 |
Paper
Llama 4 — Meta AI, 2025