Differential Transformer

Language Models PyTorch TinyShakespeare

Overview

From-scratch replication of the Differential Transformer. Standard softmax attention is replaced with differential attention: two attention score maps are computed in parallel and their difference is taken. This cancels out attention noise, producing sharper, more focused weights on relevant tokens. Based on Differential Transformers (Ye et al., 2024).

Architecture

Differential attention: Two Q/K projections per head; output = softmax(QK1ᵀ) − λ·softmax(QK2ᵀ)
Scalar λ per layer, initialised small and learned
Otherwise standard decoder-only transformer (RMSNorm, SwiGLU)

Training

Hyperparameter	Value
Dataset	TinyShakespeare
Steps	2,000 (val every 100)
Hardware	A100 GPU

Results

Split	Loss
Train	5.95
Validation	5.98

Paper

Differential Transformers — Ye et al., Microsoft Research 2024

Yuvraj Singh

Overview

Architecture

Training

Results

Paper