SimplePO

Fine-tuning PyTorch UltraFeedback
GitHub →

Overview

From-scratch implementation of SimplePO applied to OPT-330M. SimplePO is a reference-free preference optimisation method that directly maximises the log-likelihood ratio between chosen and rejected responses without a KL penalty or reward model. Based on SimPO: Simple Preference Optimization with a Reference-Free Reward (Meng et al., 2024).

Setup

  • Base model: OPT-330M
  • Dataset: UltraFeedback binarized preferences
  • Loss: Length-normalised log-ratio reward + margin γ

Training

Hyperparameter Value
Batch size 128
Optimizer Adam, lr=2e-5
Beta (reward scaling) 2
Gamma (margin) 1.6

Paper

SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al., 2024