Gemma3

Language Models PyTorch TinyStories
GitHub →

Overview

From-scratch 90M-parameter replication of Gemma 3 trained on TinyStories. The key change from Gemma 2 is local sliding-window attention with fixed block sizes, which reduces memory for long sequences without sacrificing context quality.

Architecture

  • Parameters: ~90M
  • Layers: 6 decoder layers
  • Attention: 8 heads, 2 KV heads (MQA), 128-token sliding window
  • Embedding dim: 512, vocab size 32,768
  • Regularisation: Dropout 0.1

Training

Hyperparameter Value
Dataset TinyStories
Steps 25,000 (val every 500)
Optimizer Adam, lr=2.5e-4, weight decay=0.1
Sequence length 256

Results

Split Loss
Train 2.08
Validation 1.77

Paper

Gemma 3 Technical Report — Gemma Team, Google DeepMind 2025