ORPO
Overview
From-scratch implementation of ORPO (Odds Ratio Preference Optimization), applied to OPT-330M for instruction following. ORPO unifies SFT and alignment into a single stage by penalising rejected responses via an odds-ratio term added to the NLL loss — no reference model required. Based on ORPO: Monolithic Preference Optimization without Reference Model (Hong et al., 2024).
Setup
- Base model: OPT-330M
- Dataset: UltraFeedback binarized (argilla cleaned version)
- Loss: NLL + log odds-ratio penalty on rejected responses
Training
| Hyperparameter | Value |
|---|---|
| Iterations | 3,000 |
| Optimizer | Adam, lr=8e-6, betas=(0.95, 0.99) |
| Weight decay | 0.1 |
| Batch size | 2 |
| Val frequency | Every 20 steps |
Results
| Split | Loss (at 2.5k steps) |
|---|---|
| Train | 1.70 |
| Validation | 1.98 |
Paper
ORPO: Monolithic Preference Optimization without Reference Model — Hong et al., 2024