SimplePO
Overview
From-scratch implementation of SimplePO applied to OPT-330M. SimplePO is a reference-free preference optimisation method that directly maximises the log-likelihood ratio between chosen and rejected responses without a KL penalty or reward model. Based on SimPO: Simple Preference Optimization with a Reference-Free Reward (Meng et al., 2024).
Setup
- Base model: OPT-330M
- Dataset: UltraFeedback binarized preferences
- Loss: Length-normalised log-ratio reward + margin γ
Training
| Hyperparameter | Value |
|---|---|
| Batch size | 128 |
| Optimizer | Adam, lr=2e-5 |
| Beta (reward scaling) | 2 |
| Gamma (margin) | 1.6 |
Paper
SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al., 2024