Paper Replications

  • ViT Computer Vision
    ViT-B/16 from scratch on a 3-class Food-101 subset. Train loss 1.20 / test loss 1.52.
  • GPT Language Models
    Decoder-only transformer trained on TinyShakespeare, replicating the original OpenAI GPT architecture from scratch.
  • BERT Language Models
    Bidirectional encoder pre-trained with masked language modelling on the Cornell Movie Dialogs corpus.
  • CycleGANs Generative Models
    Cycle-consistent unpaired image translation on Cityscapes — two generators, two discriminators, cycle + identity losses.
  • Differential Transformer Language Models
    Differential attention replicated from scratch — two attention maps subtracted to cancel noise. Trained on TinyShakespeare on A100.
  • Encoder-Decoder Sequential Models
    LSTM-based Seq2Seq encoder-decoder for German→English translation. Train/val loss ~1.38 in 10 epochs.
  • QLoRA fine-tuning scripts using PEFT + BitsAndBytes for both decoder and encoder-type models.
  • GRU Sequential Models
    GRU from scratch. 16 hidden units, 50 epochs. Train loss 0.51 / val loss 0.48.
  • From-scratch implementations of Bahdanau and Luong attention in PyTorch.
  • RNNs Sequential Models
    Vanilla RNN from scratch. 16 neurons, 50 epochs. Train loss 0.51 / val loss 0.50.
  • Transformer Language Models
    Encoder-decoder transformer for English→Hindi translation on Samanantar (~25M params). Published on HuggingFace.
  • Mixtral Language Models
    Sparse MoE transformer replicated from scratch on TinyShakespeare. Train loss 2.04 / val loss 2.09 in 1,000 steps on T4....
  • DPO Fine-tuning
    Direct Preference Optimization applied to Qwen0.5B-Instruct on UltraFeedback. Train loss 0.67 in 3,000 iterations.
  • SimplePO Fine-tuning
    Reference-free preference optimization (SimplePO) on OPT-330M. Batch size 128, lr=2e-5, beta=2 on UltraFeedback.
  • LoRA Fine-tuning
    Low-rank adaptation implemented from scratch in PyTorch. Train/val loss ~3.5 in 1,000 steps on A100.
  • ORPO Fine-tuning
    Odds Ratio Preference Optimization on OPT-330M. Reference-free alignment reaching train loss 1.70 in 3,000 iterations.
  • Gemma Language Models
    Google's Gemma architecture replicated from scratch — multi-query attention and GeGLU activations on TinyShakespeare.
  • Llama Language Models
    Decoder-only Llama replicated from scratch with RoPE, SwiGLU, RMSNorm and GQA.
  • CLiP Vision-Language
    Contrastive vision-language model trained on Flickr8K. Train loss 1.3 / val loss 2.2 in 30 epochs on T4.
  • DDP Training Methods
    Llama trained with PyTorch DistributedDataParallel (torchrun). Val loss 1.1 in 8,000 iterations on TinyShakespeare.
  • Llava Vision-Language
    Visual instruction tuning replicated from scratch on Flickr8K. Train loss 0.23 / val loss 0.22 in 5 epochs on T4....
  • Seq2Seq Sequential Models
    GRU-based Seq2Seq with both Bahdanau and Luong attention from scratch. 128 hidden units, 50 epochs.
  • Whisper Audio/Speech
    Whisper ASR from scratch — CNN on 80-channel mel spectrograms + 6-layer transformer decoder. Trained on GigaSpeech.
  • LSTM Sequential Models
    LSTM from scratch (~128K params). 128 hidden units, 50 epochs. Train loss 0.49 / val loss 0.48.
  • Gemma3 Language Models
    90M-parameter Gemma 3 with local sliding-window attention (128-token blocks). Val loss 1.77 in 25k steps on TinyStories.
  • Llama4 Language Models
    1.2B-parameter MoE (32×12M experts, top-1 routing) trained on TinyStories. Val loss 1.70 in 20k steps on Kaggle P100.
  • Moonshine Audio/Speech
    Compact transformer ASR (288-dim, 6 heads) trained on GigaSpeech for 1,500 steps. Notes on overfitting at ~25 hours.
  • PaliGemma Vision-Language
    Google's PaliGemma VLM (SigLIP + Gemma) replicated from scratch on Flickr8K.
  • Pix2Pix Generative Models
    Conditional GAN for paired image-to-image translation (aerial→map) replicated from scratch. PatchGAN discriminator.
  • SigLip Vision-Language
    Sigmoid-loss vision-language pretraining replicated from scratch on Flickr8K — avoids global softmax normalisation.
  • TTS Audio/Speech
    Tacotron-style transformer TTS from scratch — 512-dim phoneme encoder, mel spectrogram decoder, 16kHz on GigaSpeech.
  • VAE Generative Models
    VAE on CelebA (128×128). 4-layer conv encoder, 32D latent, ConvTranspose decoder. Reconstruction + KL loss over 200 epochs.
  • WGANs Generative Models
    Wasserstein GAN and WGAN-GP implemented from scratch on MNIST — gradient penalty for stable training.
  • Kimi-K2 Language Models
    DeepSeekV3-inspired MoE with latent attention trained with Muon optimizer. Pre-trained weights on HuggingFace.
  • CGANs Generative Models
    Conditional GAN on MNIST — class-conditioned 64×64 digit generation. 30 epochs, BCE loss, TensorBoard logging.
  • CLAP Audio/Speech
    Contrastive Language-Audio Pretraining from scratch on GigaSpeech. 768D text / 2048D audio → 1024D shared space.
  • DCGANs Generative Models
    Deep Convolutional GAN trained on CelebA and CIFAR-10. ~7,800 steps (CelebA) and ~11,700 steps (CIFAR-10).
  • DeepSeekV3 Language Models
    16×4 MoE with Multi-head Latent Attention and auxiliary-free load balancing, trained on TinyStories on Kaggle P100.