DeepSeekV3

Language Models PyTorch TinyStories
GitHub →

Overview

From-scratch replication of DeepSeek-V3. Two key innovations: Multi-head Latent Attention (MLA) which compresses KV-cache via a low-rank bottleneck, and auxiliary-free load balancing that avoids gradient interference from explicit balancing losses. Also implements multi-token prediction. Based on DeepSeek-V3 Technical Report (DeepSeek-AI, 2024).

Architecture

  • MoE: 16 experts, top-4 routing, 1 shared expert
  • Load balancing: Auxiliary-free (bias-based routing)
  • Attention: MLA with 64-dim KV latent, 8 heads
  • MTP: 1 auxiliary multi-token prediction head
  • Config: 512-dim, 8 decoder layers, 256-token context

Training

Hyperparameter Value
Dataset TinyStories (~300M tokens via grad accumulation)
Optimizer AdamW, lr=6e-4
Batch size 32, loss scale=0.3
Hardware Kaggle P100

Paper

DeepSeek-V3 Technical Report — DeepSeek-AI, 2024