Paper Replications
-
ViT Computer VisionViT-B/16 from scratch on a 3-class Food-101 subset. Train loss 1.20 / test loss 1.52.—
-
GPT Language ModelsDecoder-only transformer trained on TinyShakespeare, replicating the original OpenAI GPT architecture from scratch.—
-
BERT Language ModelsBidirectional encoder pre-trained with masked language modelling on the Cornell Movie Dialogs corpus.—
-
CycleGANs Generative ModelsCycle-consistent unpaired image translation on Cityscapes — two generators, two discriminators, cycle + identity losses.—
-
Differential Transformer Language ModelsDifferential attention replicated from scratch — two attention maps subtracted to cancel noise. Trained on TinyShakespeare on A100.—
-
Encoder-Decoder Sequential ModelsLSTM-based Seq2Seq encoder-decoder for German→English translation. Train/val loss ~1.38 in 10 epochs.—
-
Fine Tuning using PEFT Fine-tuningQLoRA fine-tuning scripts using PEFT + BitsAndBytes for both decoder and encoder-type models.—
-
GRU Sequential ModelsGRU from scratch. 16 hidden units, 50 epochs. Train loss 0.51 / val loss 0.48.—
-
Attention Mechanisms AttentionFrom-scratch implementations of Bahdanau and Luong attention in PyTorch.—
-
RNNs Sequential ModelsVanilla RNN from scratch. 16 neurons, 50 epochs. Train loss 0.51 / val loss 0.50.—
-
Transformer Language ModelsEncoder-decoder transformer for English→Hindi translation on Samanantar (~25M params). Published on HuggingFace.—
-
Mixtral Language ModelsSparse MoE transformer replicated from scratch on TinyShakespeare. Train loss 2.04 / val loss 2.09 in 1,000 steps on T4....—
-
DPO Fine-tuningDirect Preference Optimization applied to Qwen0.5B-Instruct on UltraFeedback. Train loss 0.67 in 3,000 iterations.—
-
SimplePO Fine-tuningReference-free preference optimization (SimplePO) on OPT-330M. Batch size 128, lr=2e-5, beta=2 on UltraFeedback.—
-
LoRA Fine-tuningLow-rank adaptation implemented from scratch in PyTorch. Train/val loss ~3.5 in 1,000 steps on A100.—
-
ORPO Fine-tuningOdds Ratio Preference Optimization on OPT-330M. Reference-free alignment reaching train loss 1.70 in 3,000 iterations.—
-
Gemma Language ModelsGoogle's Gemma architecture replicated from scratch — multi-query attention and GeGLU activations on TinyShakespeare.—
-
Llama Language ModelsDecoder-only Llama replicated from scratch with RoPE, SwiGLU, RMSNorm and GQA.—
-
CLiP Vision-LanguageContrastive vision-language model trained on Flickr8K. Train loss 1.3 / val loss 2.2 in 30 epochs on T4.—
-
DDP Training MethodsLlama trained with PyTorch DistributedDataParallel (torchrun). Val loss 1.1 in 8,000 iterations on TinyShakespeare.—
-
Llava Vision-LanguageVisual instruction tuning replicated from scratch on Flickr8K. Train loss 0.23 / val loss 0.22 in 5 epochs on T4....—
-
Seq2Seq Sequential ModelsGRU-based Seq2Seq with both Bahdanau and Luong attention from scratch. 128 hidden units, 50 epochs.—
-
Whisper Audio/SpeechWhisper ASR from scratch — CNN on 80-channel mel spectrograms + 6-layer transformer decoder. Trained on GigaSpeech.—
-
LSTM Sequential ModelsLSTM from scratch (~128K params). 128 hidden units, 50 epochs. Train loss 0.49 / val loss 0.48.—
-
Gemma3 Language Models90M-parameter Gemma 3 with local sliding-window attention (128-token blocks). Val loss 1.77 in 25k steps on TinyStories.—
-
Llama4 Language Models1.2B-parameter MoE (32×12M experts, top-1 routing) trained on TinyStories. Val loss 1.70 in 20k steps on Kaggle P100.—
-
Moonshine Audio/SpeechCompact transformer ASR (288-dim, 6 heads) trained on GigaSpeech for 1,500 steps. Notes on overfitting at ~25 hours.—
-
PaliGemma Vision-LanguageGoogle's PaliGemma VLM (SigLIP + Gemma) replicated from scratch on Flickr8K.—
-
Pix2Pix Generative ModelsConditional GAN for paired image-to-image translation (aerial→map) replicated from scratch. PatchGAN discriminator.—
-
SigLip Vision-LanguageSigmoid-loss vision-language pretraining replicated from scratch on Flickr8K — avoids global softmax normalisation.—
-
TTS Audio/SpeechTacotron-style transformer TTS from scratch — 512-dim phoneme encoder, mel spectrogram decoder, 16kHz on GigaSpeech.—
-
VAE Generative ModelsVAE on CelebA (128×128). 4-layer conv encoder, 32D latent, ConvTranspose decoder. Reconstruction + KL loss over 200 epochs.—
-
WGANs Generative ModelsWasserstein GAN and WGAN-GP implemented from scratch on MNIST — gradient penalty for stable training.—
-
Kimi-K2 Language ModelsDeepSeekV3-inspired MoE with latent attention trained with Muon optimizer. Pre-trained weights on HuggingFace.—
-
CGANs Generative ModelsConditional GAN on MNIST — class-conditioned 64×64 digit generation. 30 epochs, BCE loss, TensorBoard logging.—
-
CLAP Audio/SpeechContrastive Language-Audio Pretraining from scratch on GigaSpeech. 768D text / 2048D audio → 1024D shared space.—
-
DCGANs Generative ModelsDeep Convolutional GAN trained on CelebA and CIFAR-10. ~7,800 steps (CelebA) and ~11,700 steps (CIFAR-10).—
-
DeepSeekV3 Language Models16×4 MoE with Multi-head Latent Attention and auxiliary-free load balancing, trained on TinyStories on Kaggle P100.—