Paper Replications

ViT Computer Vision

ViT-B/16 from scratch on a 3-class Food-101 subset. Train loss 1.20 / test loss 1.52.

◉ —

GPT Language Models

Decoder-only transformer trained on TinyShakespeare, replicating the original OpenAI GPT architecture from scratch.

◉ —

BERT Language Models

Bidirectional encoder pre-trained with masked language modelling on the Cornell Movie Dialogs corpus.

◉ —

CycleGANs Generative Models

Cycle-consistent unpaired image translation on Cityscapes — two generators, two discriminators, cycle + identity losses.

◉ —

Differential Transformer Language Models

Differential attention replicated from scratch — two attention maps subtracted to cancel noise. Trained on TinyShakespeare on A100.

◉ —

Encoder-Decoder Sequential Models

LSTM-based Seq2Seq encoder-decoder for German→English translation. Train/val loss ~1.38 in 10 epochs.

◉ —

Fine Tuning using PEFT Fine-tuning

QLoRA fine-tuning scripts using PEFT + BitsAndBytes for both decoder and encoder-type models.

◉ —

GRU Sequential Models

GRU from scratch. 16 hidden units, 50 epochs. Train loss 0.51 / val loss 0.48.

◉ —

Attention Mechanisms Attention

From-scratch implementations of Bahdanau and Luong attention in PyTorch.

◉ —

RNNs Sequential Models

Vanilla RNN from scratch. 16 neurons, 50 epochs. Train loss 0.51 / val loss 0.50.

◉ —

Transformer Language Models

Encoder-decoder transformer for English→Hindi translation on Samanantar (~25M params). Published on HuggingFace.

◉ —

Mixtral Language Models

Sparse MoE transformer replicated from scratch on TinyShakespeare. Train loss 2.04 / val loss 2.09 in 1,000 steps on T4....

◉ —

DPO Fine-tuning

Direct Preference Optimization applied to Qwen0.5B-Instruct on UltraFeedback. Train loss 0.67 in 3,000 iterations.

◉ —

SimplePO Fine-tuning

Reference-free preference optimization (SimplePO) on OPT-330M. Batch size 128, lr=2e-5, beta=2 on UltraFeedback.

◉ —

LoRA Fine-tuning

Low-rank adaptation implemented from scratch in PyTorch. Train/val loss ~3.5 in 1,000 steps on A100.

◉ —

ORPO Fine-tuning

Odds Ratio Preference Optimization on OPT-330M. Reference-free alignment reaching train loss 1.70 in 3,000 iterations.

◉ —

Gemma Language Models

Google's Gemma architecture replicated from scratch — multi-query attention and GeGLU activations on TinyShakespeare.

◉ —

Llama Language Models

Decoder-only Llama replicated from scratch with RoPE, SwiGLU, RMSNorm and GQA.

◉ —

CLiP Vision-Language

Contrastive vision-language model trained on Flickr8K. Train loss 1.3 / val loss 2.2 in 30 epochs on T4.

◉ —

DDP Training Methods

Llama trained with PyTorch DistributedDataParallel (torchrun). Val loss 1.1 in 8,000 iterations on TinyShakespeare.

◉ —

Llava Vision-Language

Visual instruction tuning replicated from scratch on Flickr8K. Train loss 0.23 / val loss 0.22 in 5 epochs on T4....

◉ —

Seq2Seq Sequential Models

GRU-based Seq2Seq with both Bahdanau and Luong attention from scratch. 128 hidden units, 50 epochs.

◉ —

Whisper Audio/Speech

Whisper ASR from scratch — CNN on 80-channel mel spectrograms + 6-layer transformer decoder. Trained on GigaSpeech.

◉ —

LSTM Sequential Models

LSTM from scratch (~128K params). 128 hidden units, 50 epochs. Train loss 0.49 / val loss 0.48.

◉ —

Gemma3 Language Models

90M-parameter Gemma 3 with local sliding-window attention (128-token blocks). Val loss 1.77 in 25k steps on TinyStories.

◉ —

Llama4 Language Models

1.2B-parameter MoE (32×12M experts, top-1 routing) trained on TinyStories. Val loss 1.70 in 20k steps on Kaggle P100.

◉ —

Moonshine Audio/Speech

Compact transformer ASR (288-dim, 6 heads) trained on GigaSpeech for 1,500 steps. Notes on overfitting at ~25 hours.

◉ —

PaliGemma Vision-Language

Google's PaliGemma VLM (SigLIP + Gemma) replicated from scratch on Flickr8K.

◉ —

Pix2Pix Generative Models

Conditional GAN for paired image-to-image translation (aerial→map) replicated from scratch. PatchGAN discriminator.

◉ —

SigLip Vision-Language

Sigmoid-loss vision-language pretraining replicated from scratch on Flickr8K — avoids global softmax normalisation.

◉ —

TTS Audio/Speech

Tacotron-style transformer TTS from scratch — 512-dim phoneme encoder, mel spectrogram decoder, 16kHz on GigaSpeech.

◉ —

VAE Generative Models

VAE on CelebA (128×128). 4-layer conv encoder, 32D latent, ConvTranspose decoder. Reconstruction + KL loss over 200 epochs.

◉ —

WGANs Generative Models

Wasserstein GAN and WGAN-GP implemented from scratch on MNIST — gradient penalty for stable training.

◉ —

Kimi-K2 Language Models

DeepSeekV3-inspired MoE with latent attention trained with Muon optimizer. Pre-trained weights on HuggingFace.

◉ —

CGANs Generative Models

Conditional GAN on MNIST — class-conditioned 64×64 digit generation. 30 epochs, BCE loss, TensorBoard logging.

◉ —

CLAP Audio/Speech

Contrastive Language-Audio Pretraining from scratch on GigaSpeech. 768D text / 2048D audio → 1024D shared space.

◉ —

DCGANs Generative Models

Deep Convolutional GAN trained on CelebA and CIFAR-10. ~7,800 steps (CelebA) and ~11,700 steps (CIFAR-10).

◉ —

DeepSeekV3 Language Models

16×4 MoE with Multi-head Latent Attention and auxiliary-free load balancing, trained on TinyStories on Kaggle P100.

◉ —

Yuvraj Singh

Paper Replications