Differential Transformer
Overview
From-scratch replication of the Differential Transformer. Standard softmax attention is replaced with differential attention: two attention score maps are computed in parallel and their difference is taken. This cancels out attention noise, producing sharper, more focused weights on relevant tokens. Based on Differential Transformers (Ye et al., 2024).
Architecture
- Differential attention: Two Q/K projections per head; output = softmax(QK1ᵀ) − λ·softmax(QK2ᵀ)
- Scalar λ per layer, initialised small and learned
- Otherwise standard decoder-only transformer (RMSNorm, SwiGLU)
Training
| Hyperparameter | Value |
|---|---|
| Dataset | TinyShakespeare |
| Steps | 2,000 (val every 100) |
| Hardware | A100 GPU |
Results
| Split | Loss |
|---|---|
| Train | 5.95 |
| Validation | 5.98 |
Paper
Differential Transformers — Ye et al., Microsoft Research 2024