ORPO

Fine-tuning PyTorch UltraFeedback
GitHub →

Overview

From-scratch implementation of ORPO (Odds Ratio Preference Optimization), applied to OPT-330M for instruction following. ORPO unifies SFT and alignment into a single stage by penalising rejected responses via an odds-ratio term added to the NLL loss — no reference model required. Based on ORPO: Monolithic Preference Optimization without Reference Model (Hong et al., 2024).

Setup

  • Base model: OPT-330M
  • Dataset: UltraFeedback binarized (argilla cleaned version)
  • Loss: NLL + log odds-ratio penalty on rejected responses

Training

Hyperparameter Value
Iterations 3,000
Optimizer Adam, lr=8e-6, betas=(0.95, 0.99)
Weight decay 0.1
Batch size 2
Val frequency Every 20 steps

Results

Split Loss (at 2.5k steps)
Train 1.70
Validation 1.98

Paper

ORPO: Monolithic Preference Optimization without Reference Model — Hong et al., 2024