BERT
Overview
From-scratch PyTorch replication of BERT. Unlike decoder-only models, BERT conditions on both left and right context via masked language modelling (MLM), making it a strong encoder backbone for classification and retrieval. Based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019).
Architecture
Full bidirectional transformer encoder — no causal mask, tokens attend to the full sequence. Trained with the MLM objective where 15% of tokens are masked and the model reconstructs them from surrounding context.
Training
- Dataset: Cornell Movie Dialog Corpus
- Objective: Masked Language Modelling (MLM)
- Framework: PyTorch
Paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al., 2019