DeepSeekV3

Language Models PyTorch TinyStories

Overview

From-scratch replication of DeepSeek-V3. Two key innovations: Multi-head Latent Attention (MLA) which compresses KV-cache via a low-rank bottleneck, and auxiliary-free load balancing that avoids gradient interference from explicit balancing losses. Also implements multi-token prediction. Based on DeepSeek-V3 Technical Report (DeepSeek-AI, 2024).

Architecture

MoE: 16 experts, top-4 routing, 1 shared expert
Load balancing: Auxiliary-free (bias-based routing)
Attention: MLA with 64-dim KV latent, 8 heads
MTP: 1 auxiliary multi-token prediction head
Config: 512-dim, 8 decoder layers, 256-token context

Training

Hyperparameter	Value
Dataset	TinyStories (~300M tokens via grad accumulation)
Optimizer	AdamW, lr=6e-4
Batch size	32, loss scale=0.3
Hardware	Kaggle P100

Paper

DeepSeek-V3 Technical Report — DeepSeek-AI, 2024

Yuvraj Singh

Overview

Architecture

Training

Paper