DDP

Training Methods PyTorch TinyShakespeare
GitHub →

Overview

Experiments with PyTorch DistributedDataParallel (DDP) via torchrun for multi-GPU training. The base model is a small Llama variant. Focus is on the DDP training loop: process-group init, gradient synchronisation across ranks, and checkpoint management.

Architecture (base model)

  • Llama-style decoder-only transformer
  • 6 attention heads, 6 decoder layers, 384-dim, 2 KV heads (GQA)
  • 128-token block size

Training

Hyperparameter Value
Dataset TinyShakespeare
Iterations 8,000 (val every 100)
Optimizer AdamW, lr=1e-4
Batch size 64

Results

Split Loss
Train 1.5
Validation 1.1