DPO
Overview
From-scratch implementation of Direct Preference Optimization (DPO), applied to Qwen0.5B-Instruct. DPO eliminates the need for a separate reward model by directly optimising the policy from preference pairs using a closed-form loss derived from RLHF. Based on Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023).
Setup
- Base model: Qwen0.5B-Instruct
- Dataset: UltraFeedback binarized (chosen / rejected pairs from HuggingFace)
- Loss: DPO contrastive loss with reference model KL penalty (β parameter)
Training
| Hyperparameter | Value |
|---|---|
| Iterations | 3,000 |
| Optimizer | Adam, lr=1e-6 |
| Batch size | 2 |
| Val frequency | Every 20 steps |
Results
| Split | Loss |
|---|---|
| Train | 0.67 |
| Validation | 0.68 |
Paper
Direct Preference Optimization — Rafailov et al., Stanford 2023