Implementation for "Off-Policy Corrected Reward Modeling for RLHF"

Find our paper at https://arxiv.org/abs/2507.15507 and visit our poster at COLM2025!

The experiment is split into multiple files, which are integrated in the ocrm_full_1B_tldr.sh script. The steps are

ocrm/sft.py: Train a supervised fine-tuning (SFT) model on the reward model dataset.
ocrm/sample_sftmodel.py: Sample completions for the RM dataset training.
ocrm/gold_label_dataset.py: Add synthetic preference labels to the RM dataset.
ocrm/reward.py: Train a reward model on the RM dataset.
ocrm/ppo.py: Train a PPO model on the reward model dataset.
ocrm/get_density_ratios.py: Compute the density ratios for importance weighting. (A)
ocrm/reward.py: Train an off-policy corrected reward model (OCRM) on the RM dataset.
ocrm/ppo.py: Train a PPO model on the OCRM dataset.
goto (A)

We also provide an implementation of WPO in ocrm/wpo.py, based on Huang et al.'s DPO implementation.

Installation

Docker is not strictly necessary, but we recommend using it for ease of reproducibility.

When using docker you can do the following

git clone git@github.com:JohannesAck/OffPolicyCorrectedRewardModeling.git
cd OffPolicyCorrectedRewardModeling
docker build -f .devcontainer/Dockerfile -t my-ocrm-image .
docker run --gpus all --shm-size=32g -it --rm -v $(pwd):/app my-ocrm-image
cd /app
bash ocrm_full_DEBUG.sh

Alternatively you can use poetry to install the dependencies:

⚠️ When not using the docker container replace /root/.local/bin/poetry run with poetry run in the .sh scripts.

git clone git@github.com:JohannesAck/OffPolicyCorrectedRewardModeling.git
cd OffPolicyCorrectedRewardModeling
poetry install
poetry run pip install --upgrade setuptools packaging wheel pip ninja
poetry run pip install deepspeed==0.14.4
poetry run pip install --no-build-isolation --no-deps flash-attn
bash ocrm_full_DEBUG.sh

To check if your installation works, use ocrm_full_DEBUG.sh. This script trains a 14M model for only a few training steps to make sure the whole process works and runs on a 16 GB GPU in ~10 minutes.

Running

To reproduce the Pythia-1B TL;DR experiments use the script ocrm_full_1B_tldr.sh

To reproduce the Qwen 2.5 1.5B Alpaca experiments use the script ocrm_full_1_5B_alpaca.sh

The above scripts also generate the SFT and RM datasets, which can be used to reproduce our WPO and DPO results with the scripts ocrm/wpo.py and ocrm/dpo.py.

For faster training, consider replacing the ocrm/ppo.py step with TRL or verl, but we are not supporting it here at the moment.

Datasets

For Pythia, we use the TL;DR dataset provided by Huang et al. https://huggingface.co/datasets/vwxyzjn/summarize_from_feedback_tldr_3_filtered_oai_preprocessing_1706381144 and add a Qwen-tokenized version https://huggingface.co/datasets/johannesack/summarize_from_feedback_tldr_3_filtered_oai_preprocessingQWEN_1748920372 .

For Alpaca we use a length-filtered version of the Alpaca-Farm dataset. This dataset is generated by the ocrm/alpaca_dataset.py script and is available at https://huggingface.co/datasets/johannesack/alpaca_sft1738133725 for Pythia and https://huggingface.co/datasets/johannesack/alpaca_qwen_sft1738158738 for Qwen.

Acknowledgements

Our implementation is heavily based on Huang et al.'s https://github.com/vwxyzjn/summarize_from_feedback_details . Be sure to check out their great paper on reproducing the original OpenAI summarization experiments: https://arxiv.org/abs/2403.17031

Citation

@inproceedings{
ackermann2025off_policy,
title={Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback},
author={Johannes Ackermann and Takashi Ishida and Masashi Sugiyama},
booktitle={Conference on Language Modeling (COLM) 2025)},
year={2025},
url={https://openreview.net/forum?id=0zxugBcgF5}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
ocrm		ocrm
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
deepspeed.yaml		deepspeed.yaml
deepspeedNoOpt.yaml		deepspeedNoOpt.yaml
deepspeedSingle.yaml		deepspeedSingle.yaml
ocrm_full_1B_tldr.sh		ocrm_full_1B_tldr.sh
ocrm_full_1_5B_alpaca.sh		ocrm_full_1_5B_alpaca.sh
ocrm_full_DEBUG.sh		ocrm_full_DEBUG.sh
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Implementation for "Off-Policy Corrected Reward Modeling for RLHF"

Find our paper at https://arxiv.org/abs/2507.15507 and visit our poster at COLM2025!

Installation

Running

Datasets

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

JohannesAck/OffPolicyCorrectedRewardModeling

Folders and files

Latest commit

History

Repository files navigation

Implementation for "Off-Policy Corrected Reward Modeling for RLHF"

Find our paper at https://arxiv.org/abs/2507.15507 and visit our poster at COLM2025!

Installation

Running

Datasets

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages