Llama
Overview
From-scratch PyTorch replication of the Llama architecture. Llama improved upon vanilla GPT by replacing LayerNorm with RMSNorm, using SwiGLU activations, and adopting Rotary Positional Embeddings (RoPE) — changes that collectively improve training stability and efficiency. Based on LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023).
Architecture
- Norm: RMSNorm (pre-norm)
- Activations: SwiGLU feed-forward sublayers
- Position: Rotary Positional Embeddings (RoPE)
- Attention: Grouped-Query Attention (GQA)
- Decoder-only autoregressive stack
Training
- Dataset: TinyShakespeare
- Objective: Causal language modelling
- Framework: PyTorch
Paper
LLaMA: Open and Efficient Foundation Language Models — Touvron et al., 2023