-
Notifications
You must be signed in to change notification settings - Fork 7k
[Docs] Add RL example for RDT, vLLM, FSDP2, and GRPO. #58314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
|
|
||
|
|
||
| @ray.remote(name=REGISTRY_NAME) | ||
| class RayObjectRefRegistry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my workaround for sending the RDT ObjectRef to the vLLM workers. collective_rpc() is unable to send the ObjectRef because of the way it serializes objects: https://gist.github.com/crypdick/8bd703085f5c8f8b2f4d2def58bac516
| self.optim.step() | ||
| else: | ||
| print( | ||
| "[WARNING] Skipping optimizer step due to zero gradients - all samples likely have same reward", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, after the first training step the Generator state collapses and outputs nothing but end of sentence tokens: <s><s><s><s><s><s><s>. One simple way to fix this is to give smoother rewards, i.e. instead of giving all wrong answers 0 reward, giving a bit more reward if the answer is closer to the target.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
| for name, weight in state_dict.items(): | ||
| # FIXME: Qiaolin, remove deepcopy once the bug with sending the same weights multiple times is fixed. | ||
| name_weight = (name, weight) | ||
| name_weight = copy.deepcopy(name_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to use deepcopy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qiaolin-Yu It was necessary at the time I opened the PR, I am not sure if it has been fixed since then
Description
Demonstrates RDT in using RL post-training of an LLM against the GSM8K dataset.
Related issues
Companion PR to #57961