Kimi-K2

Category: Language Models

Framework: PyTorch

Dataset: Custom

Created: August 06, 2025

GitHub: View Implementation

Overview

From scratch implementation of Kimi-K2

Technical Details

Framework: PyTorch
Dataset: Custom
Category: Language Models

Implementation Details

A PyTorch reimplementation of a DeepSeek V3-inspired transformer model with Mixture of Experts (MoE), Latent Attention, and other advanced features.

🔗 View StoryKimi Model

📊 Training Results & Model Weights

📈 View Training Report: StoryKimi Training Results on WandB

💾 Download Pre-trained Weights:

Hugging Face Model: YuvrajSingh9886/StoryKimi
WandB Checkpoints: Check the WandB report above for additional trained model checkpoints

Features

Latent Attention: Efficient attention mechanism with compressed key-value representations
Mixture of Experts (MoE): 8 experts with top-2 routing and shared expert support
SWiGLU Activation: Advanced activation function in expert layers
Sinusoidal Positional Embeddings: Position encoding for sequence understanding
Liger Kernels: Optimized kernels for faster training (optional)
Distributed Training: Support for multi-GPU training with DDP
Advanced Optimizer: Muon optimizer with auxiliary Adam for better convergence
Gradio Interface: Interactive web interface for text generation

Model Architecture

Default Configuration

Embedding Dimensions: 384
Decoder Layers: 6
Attention Heads: 8
MoE Experts: 8 (top-2 routing)
Block Size: 128 tokens
Vocabulary Size: Based on Llama-2-7b tokenizer (~32,000 tokens)
Latent Dimension: 64 (for compressed attention)

Full Parameter List

Model Architecture Parameters

--block_size: Maximum sequence length (default: 128)
--batch_size: Training batch size (default: 256)
--embeddings_dims: Model embedding dimensions (default: 384)
--no_of_heads: Number of attention heads (default: 8)
--no_of_decoder_layers: Number of decoder layers (default: 6)
--latent_dim: Latent dimension for attention (default: 64)

Mixture of Experts (MoE) Parameters

--experts: Number of MoE experts (default: 8)
--top_experts: Number of experts to route to (default: 2)
--use_shared_expert: Enable shared expert in MoE (default: True)
--noisy_topk: Use noisy top-k routing (default: False)
--useauxFreeLoadBalancingLoss: Use auxiliary-free load balancing loss (default: True)
--aux_free_bias_update_rate: Bias update rate for load balancing (default: 0.001)
--loss_scale: Loss scaling factor (default: 0.3)

Training Hyperparameters

--epochs: Number of training epochs (default: 1)
--max_lr: Maximum learning rate (default: 6e-4)
--weight_decay_optim: Weight decay for optimizer (default: 0.1)
--beta_1: Beta1 for optimizer (default: 0.9)
--beta_2: Beta2 for optimizer (default: 0.95)
--eps: Epsilon for optimizer (default: 1e-8)
--clip: Gradient clipping value (default: 1.0)

Regularization Parameters

--dropout: Dropout rate (default: 0.1)
--attn_dropout: Attention dropout rate (default: 0.1)

System Configuration

--device: Device to use (default: ‘cuda’)
--use_checkpointing: Use gradient checkpointing (default: False)
--use_liger: Use Liger kernels for optimization (default: True)
--ignore_pad_token_in_loss: Ignore padding tokens in loss calculation (default: True)

Data Configuration

--vocab_size: Vocabulary size (default: 32000, updated based on tokenizer)
--base_freq: Base frequency for positional encoding (default: 100000)
--hf_token: Hugging Face token for accessing gated models like Llama-2 (default: None)
--dataset: Dataset to use (‘tinystories’, ‘fineweb’, ‘tinyshakespeare’) (default: ‘tinystories’)

Generation Parameters

--generation_max_length: Maximum length for text generation (default: 50)
--generation_top_k: Top-k value for sampling (default: 50)
--generation_temperature: Temperature for sampling (default: 1.0)

Logging and Checkpointing

--log_interval: Steps between logging (default: 100)
--save_interval: Steps between saving checkpoints (default: 2000)
--eval_interval: Steps between evaluation (default: 400)
--eval_iters: Number of iterations for evaluation (default: 400)
--warmup_iters: Number of warmup iterations (default: 400)
--total_iters: Total training iterations (default: 10000)
--lr_decay_iters: Learning rate decay iterations (default: 10000)
--wandb_project: Wandb project name (default: ‘storykimi’)
--wandb_run_name: Wandb run name (default: None)

Batch Size Configuration

--total_batch_size: Total batch size for gradient accumulation (default: 524288)
--micro_batch_size: Micro batch size (default: batch_size)

Distributed Training

--use_ddp: Use distributed data parallel (default: False)

Quick Start

Installation

chmod +x install.sh
./install.sh

Using Pre-trained Weights

Download Model Weights:
- Option 1: Download from Hugging Face - YuvrajSingh9886/StoryKimi
- Option 2: Visit the WandB Training Report for additional checkpoints
- Place downloaded files in the checkpoints/ directory

Load Pre-trained Model for Inference:

# Using the Gradio web interface
python gradio/app.py --hf_token "your_token_here"
   
# Or use in your own code
python inference.py --checkpoint_path checkpoints/your_checkpoint.pt
   
# Using Hugging Face transformers (if available)
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("YuvrajSingh9886/StoryKimi")

Important: Hugging Face Token Setup

Since this model uses the Llama-2 tokenizer, you’ll need a Hugging Face token to access the gated model.

Get a Hugging Face Token:
- Go to Hugging Face Settings
- Create a new token with “Read” permissions
- Accept the Llama-2 license at meta-llama/Llama-2-7b-hf

Set your token in one of these ways:

# Option 1: Environment variable (recommended)
export HF_TOKEN="your_token_here"
   
# Option 2: Pass as command line argument
python trainer.py --hf_token "your_token_here"

Training Examples

Basic Training (Single GPU)

# With environment variable
export HF_TOKEN="your_token_here"
python trainer.py

# With command line argument
python trainer.py --hf_token "your_token_here"

Training with Custom Parameters

# Train with larger model
python trainer.py --hf_token "your_token_here" --embeddings_dims 512 --no_of_heads 16 --no_of_decoder_layers 8

# Train with different dataset
python trainer.py --hf_token "your_token_here" --dataset fineweb --epochs 3

# Train with custom learning rate and batch size
python trainer.py --hf_token "your_token_here" --max_lr 1e-3 --batch_size 128 --block_size 256

# Train with more experts
python trainer.py --hf_token "your_token_here" --experts 16 --top_experts 4

# Train without shared expert
python trainer.py --hf_token "your_token_here" --use_shared_expert False

# Train with noisy top-k routing
python trainer.py --hf_token "your_token_here" --noisy_topk True

Multi-GPU Distributed Training

# Set token as environment variable for distributed training
export HF_TOKEN="your_token_here"

# 2 GPUs
torchrun --nproc_per_node=2 trainer.py

# 4 GPUs with custom parameters
torchrun --nproc_per_node=4 trainer.py --batch_size 128 --embeddings_dims 512

# 8 GPUs with large model configuration
torchrun --nproc_per_node=8 trainer.py \
    --embeddings_dims 768 \
    --no_of_heads 12 \
    --no_of_decoder_layers 12 \
    --experts 16 \
    --top_experts 4 \
    --batch_size 64 \
    --block_size 512

Advanced Training Configurations

High-Performance Setup

export HF_TOKEN="your_token_here"
python trainer.py \
    --embeddings_dims 768 \
    --no_of_heads 12 \
    --no_of_decoder_layers 12 \
    --experts 16 \
    --top_experts 4 \
    --batch_size 32 \
    --block_size 512 \
    --max_lr 3e-4 \
    --epochs 5 \
    --use_liger True \
    --wandb_project "storykimi-large"

Experimental Setup

export HF_TOKEN="your_token_here"
python trainer.py \
    --noisy_topk True \
    --use_shared_expert False \
    --aux_free_bias_update_rate 0.01 \
    --loss_scale 0.5 \
    --dropout 0.2 \
    --attn_dropout 0.15 \
    --wandb_project "storykimi-experimental"

Memory-Efficient Setup

export HF_TOKEN="your_token_here"
python trainer.py \
    --use_checkpointing True \
    --batch_size 64 \
    --micro_batch_size 16 \
    --total_batch_size 262144 \
    --block_size 128

Inference with Gradio

# Set your HF token
export HF_TOKEN="your_token_here"

# Run the Gradio app
cd gradio
python app.py --hf_token "your_token_here"

# Or with environment variable
cd gradio
python app.py

# With custom port and public sharing
cd gradio
python app.py --hf_token "your_token_here" --port 8080 --share

Help and Parameter Information

# View all available parameters
python trainer.py --help

# View Gradio app parameters
cd gradio
python app.py --help

Environment Variables

You can set the following environment variables instead of passing them as arguments:

# Hugging Face token (recommended approach)
export HF_TOKEN="your_token_here"

# Wandb API key (optional, for experiment tracking)
export WANDB_API_KEY="your_wandb_key_here"

File Structure

StoryKimi/
├── config.py          # Model configuration and hyperparameters with argparse
├── model.py           # Model architecture (DeepSeekV3, MoE, Attention, etc.)
├── tokenizer.py       # Tokenizer setup
├── data.py           # Data loading and preparation
├── inference.py      # Inference functions and text generation
├── trainer.py        # Main training loop with DDP support
├── install.sh        # Setup script
├── requirements.txt  # Python dependencies
├── gradio/
│   ├── app.py        # Gradio web interface
│   └── requirements.txt
└── generated_data/   # Generated text outputs

Training Features

Gradient Accumulation: Configurable batch size scaling
Learning Rate Scheduling: Cosine decay with warmup
Gradient Clipping: Prevents gradient explosion
Wandb Integration: Experiment tracking and logging
Checkpointing: Regular model checkpoints during training
Loss Calculation: Optimized cross-entropy with padding token handling
Distributed Training: Multi-GPU support with DDP
Memory Optimization: Gradient checkpointing support

Generation Methods

Top-k Sampling: Traditional sampling with temperature control
Beam Search: Deterministic search for high-quality outputs

Advanced Usage

Configuration Files

All parameters can be set via command line arguments. For complex configurations, consider creating shell scripts:

#!/bin/bash
# large_model_config.sh
python trainer.py \
    --embeddings_dims 1024 \
    --no_of_heads 16 \
    --no_of_decoder_layers 24 \
    --experts 32 \
    --top_experts 8 \
    --batch_size 16 \
    --block_size 1024 \
    --max_lr 1e-4 \
    --epochs 10 \
    --use_liger True \
    --use_checkpointing True \
    --wandb_project "storykimi-large-scale"

Custom Dataset Training

# TinyStories (default)
python trainer.py --dataset tinystories

# FineWeb (large scale)
python trainer.py --dataset fineweb --epochs 3 --batch_size 64

# TinyShakespeare (character level)
python trainer.py --dataset tinyshakespeare --block_size 256

Monitoring and Logging

# Custom wandb configuration
python trainer.py \
    --wandb_project "my-experiment" \
    --wandb_run_name "test-run-1" \
    --log_interval 50 \
    --eval_interval 200 \
    --save_interval 1000

Hardware-Specific Optimizations

For High-Memory GPUs (A100, H100)

python trainer.py \
    --batch_size 512 \
    --block_size 2048 \
    --embeddings_dims 1024 \
    --total_batch_size 1048576

For Low-Memory GPUs (RTX 3080, 4080)

python trainer.py \
    --batch_size 32 \
    --micro_batch_size 8 \
    --block_size 128 \
    --use_checkpointing True \
    --embeddings_dims 256

Usage Examples

Basic Training

from trainer import train
train()

Text Generation

from inference import topk_sampling
from model import DeepSeekV3
from config import ModelArgs, get_args

# Load with custom config
args = get_args()
model_args = ModelArgs(args)
model = DeepSeekV3(device='cuda')
text = topk_sampling(model, "Once upon a time", device='cuda')

Loading a Trained Model

import torch
from model import DeepSeekV3
from config import ModelArgs, get_args

# Load saved model
args = get_args()
model_args = ModelArgs(args)
model = DeepSeekV3(device='cuda')
model.load_state_dict(torch.load('path/to/checkpoint.pt'))
model.eval()

Performance Tips

Use Mixed Precision: Enable automatic mixed precision for faster training
Gradient Checkpointing: Use --use_checkpointing True for memory-constrained setups
Liger Kernels: Keep --use_liger True for optimized operations
Batch Size Tuning: Start with smaller batch sizes and increase gradually
Block Size: Larger block sizes improve quality but require more memory

Troubleshooting

Common Issues

Authentication Error (401)

# Make sure you have accepted the Llama-2 license and have a valid token
# Visit: https://huggingface.co/meta-llama/Llama-2-7b-hf
# Then set your token:
export HF_TOKEN="your_token_here"

Out of Memory (OOM)

# Reduce batch size and enable checkpointing
python trainer.py --hf_token "your_token_here" --batch_size 16 --use_checkpointing True

# Use gradient accumulation
python trainer.py --hf_token "your_token_here" --batch_size 32 --micro_batch_size 8

Slow Training

# Enable Liger kernels and increase batch size
python trainer.py --hf_token "your_token_here" --use_liger True --batch_size 256

# Use multiple GPUs
export HF_TOKEN="your_token_here"
torchrun --nproc_per_node=4 trainer.py

NaN Loss

# Reduce learning rate and enable gradient clipping
python trainer.py --hf_token "your_token_here" --max_lr 1e-4 --clip 0.5

Contributing

Feel free to contribute improvements, bug fixes, or new features!

Requirements

Python 3.8+
PyTorch 2.0+
Transformers
Datasets
Gradio
Wandb
Liger-kernel (optional)
Muon optimizer

License

MIT License

Source Code

📁 GitHub Repository: Kimi-K2

View the complete implementation, training scripts, and documentation on GitHub.

Share on

Twitter Facebook LinkedIn

Yuvraj Singh