Skip to content

Conversation

@tastelikefeet
Copy link
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tastelikefeet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive Ray support for Direct Preference Optimization (DPO), significantly expanding the distributed training capabilities. It involves a major refactoring of the underlying Ray integration mechanisms, providing more granular control over how different model components and training steps are distributed across Ray workers. This update not only adds DPO to the suite of Ray-supported features but also enhances the scalability and efficiency of existing pre-training and supervised fine-tuning workflows through a more robust and modular Ray infrastructure.

Highlights

  • Ray DPO Support: Direct Preference Optimization (DPO) training can now be distributed using Ray, with new example scripts and configurations provided to facilitate its use in multi-node and multi-GPU environments.
  • Generalized Ray Integration: The core Ray integration has been significantly refactored and enhanced, introducing a RayMixin and a worker registry. This allows for more flexible and robust distribution of various training components and steps across different Ray worker groups, such as pt:default, sft:default, rlhf:default, ref, reward, value, and teacher.
  • Distributed Model Preparation: Model components, including the main model, reference model, and reward model, are now prepared in a distributed manner using @RayHelper.function decorators. This change improves efficiency and scalability during the setup phase for complex multi-model training scenarios.
  • Updated Documentation and Examples: Both Chinese and English documentation, along with example scripts, have been updated to reflect the new Ray DPO support and the refined Ray integration for pre-training (PT) and supervised fine-tuning (SFT).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for distributed DPO training using Ray, which is a significant new feature. The implementation is comprehensive, with changes to documentation, new examples, and substantial refactoring of the core training logic to accommodate a distributed setup with separate workers for policy and reference models. While the overall approach is solid, I've identified a couple of critical issues related to model initialization and usage in the distributed context that would prevent the feature from functioning as intended. I've also noted a minor issue regarding error handling. Addressing these points will be crucial for the stability and correctness of the new Ray DPO support.

Comment on lines +29 to +32
super().__init__(args)
self.reward_model = []
if self.args.rlhf_type == 'grpo':
self.reward_template = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The initialization of self.reward_model and self.reward_template happens after super().__init__(args) is called. The super().__init__ call chain eventually triggers remote calls to methods like _prepare_reward_model, which attempt to use these attributes on the worker instances. Since they are not yet initialized on the workers when these remote methods are called, this will lead to an AttributeError.

To fix this, these attributes should be initialized before the super().__init__(args) call. Also, note that self.args is not available before super().__init__, so you should use the args from the __init__ method signature for the conditional initialization of self.reward_template.

Suggested change
super().__init__(args)
self.reward_model = []
if self.args.rlhf_type == 'grpo':
self.reward_template = []
self.reward_model = []
if args.rlhf_type == 'grpo':
self.reward_template = []
super().__init__(args)

return (mini_batch, ), {}


class DPOTrainer(RLHFTrainerMixin, SwiftMixin, DataLoaderMixin, HFDPOTrainer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The DPOTrainer inherits from trl.DPOTrainer, which uses self.ref_model to compute reference log probabilities within its get_batch_logps method. In this distributed setup, self.ref_model is None on the default worker where the trainer runs, as the reference model resides on a separate ref worker. This will cause the trainer to incorrectly use the policy model for computing reference log probabilities.

While you've correctly introduced _compute_ref_log_probs to perform remote computation on the ref worker, it is not being called. To fix this, you should override the get_batch_logps method in your custom DPOTrainer to use self.compute_ref_log_probs when a separate reference model is specified. This will ensure the reference log probabilities are computed on the correct worker.

try:
ray.get(RayHelper._registry.clear.remote())
ray.kill(RayHelper._registry)
except: # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a bare except clause is generally discouraged as it can catch and hide unexpected system-exiting exceptions like SystemExit or KeyboardInterrupt, making it harder to debug issues. It's better to catch a more specific exception, such as Exception, to avoid unintentionally suppressing important errors during the teardown process.

Suggested change
except: # noqa
except Exception: # noqa

@tastelikefeet tastelikefeet changed the title Support ray dpo [WIP]Support ray dpo Nov 1, 2025
@tastelikefeet tastelikefeet changed the title [WIP]Support ray dpo [WIP][Script]Support ray dpo Nov 1, 2025
@tastelikefeet tastelikefeet changed the title [WIP][Script]Support ray dpo [WIP][Exp]Support ray dpo Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants