Gemma3
Overview
From-scratch 90M-parameter replication of Gemma 3 trained on TinyStories. The key change from Gemma 2 is local sliding-window attention with fixed block sizes, which reduces memory for long sequences without sacrificing context quality.
Architecture
- Parameters: ~90M
- Layers: 6 decoder layers
- Attention: 8 heads, 2 KV heads (MQA), 128-token sliding window
- Embedding dim: 512, vocab size 32,768
- Regularisation: Dropout 0.1
Training
| Hyperparameter | Value |
|---|---|
| Dataset | TinyStories |
| Steps | 25,000 (val every 500) |
| Optimizer | Adam, lr=2.5e-4, weight decay=0.1 |
| Sequence length | 256 |
Results
| Split | Loss |
|---|---|
| Train | 2.08 |
| Validation | 1.77 |
Paper
Gemma 3 Technical Report — Gemma Team, Google DeepMind 2025