[WIP][Exp]Support ray dpo #6395

tastelikefeet · 2025-11-01T09:46:27Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

…at/dpo-ray

…-ray # Conflicts: # swift/ray/resource_manager.py

gemini-code-assist · 2025-11-01T09:47:00Z

Summary of Changes

Hello @tastelikefeet, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive Ray support for Direct Preference Optimization (DPO), significantly expanding the distributed training capabilities. It involves a major refactoring of the underlying Ray integration mechanisms, providing more granular control over how different model components and training steps are distributed across Ray workers. This update not only adds DPO to the suite of Ray-supported features but also enhances the scalability and efficiency of existing pre-training and supervised fine-tuning workflows through a more robust and modular Ray infrastructure.

Highlights

Ray DPO Support: Direct Preference Optimization (DPO) training can now be distributed using Ray, with new example scripts and configurations provided to facilitate its use in multi-node and multi-GPU environments.
Generalized Ray Integration: The core Ray integration has been significantly refactored and enhanced, introducing a RayMixin and a worker registry. This allows for more flexible and robust distribution of various training components and steps across different Ray worker groups, such as pt:default, sft:default, rlhf:default, ref, reward, value, and teacher.
Distributed Model Preparation: Model components, including the main model, reference model, and reward model, are now prepared in a distributed manner using @RayHelper.function decorators. This change improves efficiency and scalability during the setup phase for complex multi-model training scenarios.
Updated Documentation and Examples: Both Chinese and English documentation, along with example scripts, have been updated to reflect the new Ray DPO support and the refined Ray integration for pre-training (PT) and supervised fine-tuning (SFT).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for distributed DPO training using Ray, which is a significant new feature. The implementation is comprehensive, with changes to documentation, new examples, and substantial refactoring of the core training logic to accommodate a distributed setup with separate workers for policy and reference models. While the overall approach is solid, I've identified a couple of critical issues related to model initialization and usage in the distributed context that would prevent the feature from functioning as intended. I've also noted a minor issue regarding error handling. Addressing these points will be crucial for the stability and correctness of the new Ray DPO support.

gemini-code-assist · 2025-11-01T09:51:29Z

swift/llm/train/rlhf.py

+        super().__init__(args)
+        self.reward_model = []
+        if self.args.rlhf_type == 'grpo':
+            self.reward_template = []


The initialization of self.reward_model and self.reward_template happens after super().__init__(args) is called. The super().__init__ call chain eventually triggers remote calls to methods like _prepare_reward_model, which attempt to use these attributes on the worker instances. Since they are not yet initialized on the workers when these remote methods are called, this will lead to an AttributeError.

To fix this, these attributes should be initialized before the super().__init__(args) call. Also, note that self.args is not available before super().__init__, so you should use the args from the __init__ method signature for the conditional initialization of self.reward_template.

Suggested change

super().__init__(args)

self.reward_model = []

if self.args.rlhf_type == 'grpo':

self.reward_template = []

self.reward_model = []

if args.rlhf_type == 'grpo':

self.reward_template = []

super().__init__(args)

gemini-code-assist · 2025-11-01T09:51:29Z

swift/trainers/rlhf_trainer/dpo_trainer.py

+    return (mini_batch, ), {}
+
+
 class DPOTrainer(RLHFTrainerMixin, SwiftMixin, DataLoaderMixin, HFDPOTrainer):


The DPOTrainer inherits from trl.DPOTrainer, which uses self.ref_model to compute reference log probabilities within its get_batch_logps method. In this distributed setup, self.ref_model is None on the default worker where the trainer runs, as the reference model resides on a separate ref worker. This will cause the trainer to incorrectly use the policy model for computing reference log probabilities.

While you've correctly introduced _compute_ref_log_probs to perform remote computation on the ref worker, it is not being called. To fix this, you should override the get_batch_logps method in your custom DPOTrainer to use self.compute_ref_log_probs when a separate reference model is specified. This will ensure the reference log probabilities are computed on the correct worker.

gemini-code-assist · 2025-11-01T09:51:29Z

swift/ray/base.py

+            try:
+                ray.get(RayHelper._registry.clear.remote())
+                ray.kill(RayHelper._registry)
+            except:  # noqa


Using a bare except clause is generally discouraged as it can catch and hide unexpected system-exiting exceptions like SystemExit or KeyboardInterrupt, making it harder to debug issues. It's better to catch a more specific exception, such as Exception, to avoid unintentionally suppressing important errors during the teardown process.

Suggested change

except: # noqa

except Exception: # noqa

tastelikefeet and others added 18 commits October 29, 2025 20:19

wip

ccc0edd

wip

3d585fd

fix

2cf6194

Merge branch 'feat/dpo-ray' of github.com:tastelikefeet/swift into fe…

51fc512

…at/dpo-ray

revert files

9156320

Merge commit '07b7110201b654fcb7cd3b31f72b6697210688be' into feat/dpo…

c43ab45

…-ray # Conflicts: # swift/ray/resource_manager.py

wip

3a9e575

wip

c4439ed

wip

53f0327

Merge branch 'main' into feat/dpo-ray

cfd2433

fix

c303e1f

fix

c9e1cb6

lint

01b4f66

fix

789b3c3

fix

0b46447

wip

1e98804

fix

746e533

fix

afd608b

fix doc

081fa47

gemini-code-assist bot reviewed Nov 1, 2025

View reviewed changes

hjh0119 approved these changes Nov 1, 2025

View reviewed changes

tastelikefeet changed the title ~~Support ray dpo~~ [WIP]Support ray dpo Nov 1, 2025

tastelikefeet changed the title ~~[WIP]Support ray dpo~~ [WIP][Script]Support ray dpo Nov 1, 2025

tastelikefeet changed the title ~~[WIP][Script]Support ray dpo~~ [WIP][Exp]Support ray dpo Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Exp]Support ray dpo #6395

[WIP][Exp]Support ray dpo #6395

Uh oh!

tastelikefeet commented Nov 1, 2025

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return (mini_batch, ), {}


		class DPOTrainer(RLHFTrainerMixin, SwiftMixin, DataLoaderMixin, HFDPOTrainer):

[WIP][Exp]Support ray dpo #6395

Are you sure you want to change the base?

[WIP][Exp]Support ray dpo #6395

Uh oh!

Conversation

tastelikefeet commented Nov 1, 2025

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants