Find our paper at https://arxiv.org/abs/2507.15507 and visit our poster at COLM2025!
The experiment is split into multiple files, which are integrated in the ocrm_full_1B_tldr.sh
script.
The steps are
ocrm/sft.py
: Train a supervised fine-tuning (SFT) model on the reward model dataset.ocrm/sample_sftmodel.py
: Sample completions for the RM dataset training.ocrm/gold_label_dataset.py
: Add synthetic preference labels to the RM dataset.ocrm/reward.py
: Train a reward model on the RM dataset.ocrm/ppo.py
: Train a PPO model on the reward model dataset.ocrm/get_density_ratios.py
: Compute the density ratios for importance weighting. (A)ocrm/reward.py
: Train an off-policy corrected reward model (OCRM) on the RM dataset.ocrm/ppo.py
: Train a PPO model on the OCRM dataset.- goto (A)
We also provide an implementation of WPO in ocrm/wpo.py
, based on Huang et al.'s DPO implementation.
Docker is not strictly necessary, but we recommend using it for ease of reproducibility.
When using docker you can do the following
git clone git@github.com:JohannesAck/OffPolicyCorrectedRewardModeling.git
cd OffPolicyCorrectedRewardModeling
docker build -f .devcontainer/Dockerfile -t my-ocrm-image .
docker run --gpus all --shm-size=32g -it --rm -v $(pwd):/app my-ocrm-image
cd /app
bash ocrm_full_DEBUG.sh
Alternatively you can use poetry to install the dependencies:
/root/.local/bin/poetry run
with poetry run
in the .sh
scripts.
git clone git@github.com:JohannesAck/OffPolicyCorrectedRewardModeling.git
cd OffPolicyCorrectedRewardModeling
poetry install
poetry run pip install --upgrade setuptools packaging wheel pip ninja
poetry run pip install deepspeed==0.14.4
poetry run pip install --no-build-isolation --no-deps flash-attn
bash ocrm_full_DEBUG.sh
To check if your installation works, use ocrm_full_DEBUG.sh
.
This script trains a 14M model for only a few training steps to make sure the whole process works and runs on a 16 GB GPU in ~10 minutes.
To reproduce the Pythia-1B TL;DR experiments use the script ocrm_full_1B_tldr.sh
To reproduce the Qwen 2.5 1.5B Alpaca experiments use the script ocrm_full_1_5B_alpaca.sh
The above scripts also generate the SFT and RM datasets, which can be used to reproduce our WPO and DPO results with the scripts ocrm/wpo.py
and ocrm/dpo.py
.
For faster training, consider replacing the ocrm/ppo.py
step with TRL or verl, but we are not supporting it here at the moment.
For Pythia, we use the TL;DR dataset provided by Huang et al. https://huggingface.co/datasets/vwxyzjn/summarize_from_feedback_tldr_3_filtered_oai_preprocessing_1706381144 and add a Qwen-tokenized version https://huggingface.co/datasets/johannesack/summarize_from_feedback_tldr_3_filtered_oai_preprocessingQWEN_1748920372 .
For Alpaca we use a length-filtered version of the Alpaca-Farm dataset.
This dataset is generated by the ocrm/alpaca_dataset.py
script and is available at https://huggingface.co/datasets/johannesack/alpaca_sft1738133725 for Pythia and https://huggingface.co/datasets/johannesack/alpaca_qwen_sft1738158738 for Qwen.
Our implementation is heavily based on Huang et al.'s https://github.com/vwxyzjn/summarize_from_feedback_details . Be sure to check out their great paper on reproducing the original OpenAI summarization experiments: https://arxiv.org/abs/2403.17031
@inproceedings{
ackermann2025off_policy,
title={Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback},
author={Johannes Ackermann and Takashi Ishida and Masashi Sugiyama},
booktitle={Conference on Language Modeling (COLM) 2025)},
year={2025},
url={https://openreview.net/forum?id=0zxugBcgF5}
}