Skip to content

Conversation

@crypdick
Copy link
Contributor

@crypdick crypdick commented Oct 30, 2025

Description

Demonstrates RDT in using RL post-training of an LLM against the GSM8K dataset.

Related issues

Companion PR to #57961

Signed-off-by: Ricardo Decal <rdecal@anyscale.com>


@ray.remote(name=REGISTRY_NAME)
class RayObjectRefRegistry:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my workaround for sending the RDT ObjectRef to the vLLM workers. collective_rpc() is unable to send the ObjectRef because of the way it serializes objects: https://gist.github.com/crypdick/8bd703085f5c8f8b2f4d2def58bac516

self.optim.step()
else:
print(
"[WARNING] Skipping optimizer step due to zero gradients - all samples likely have same reward",
Copy link
Contributor Author

@crypdick crypdick Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, after the first training step the Generator state collapses and outputs nothing but end of sentence tokens: <s><s><s><s><s><s><s>. One simple way to fix this is to give smoother rewards, i.e. instead of giving all wrong answers 0 reward, giving a bit more reward if the answer is closer to the target.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2025
for name, weight in state_dict.items():
# FIXME: Qiaolin, remove deepcopy once the bug with sending the same weights multiple times is fixed.
name_weight = (name, weight)
name_weight = copy.deepcopy(name_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to use deepcopy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiaolin-Yu It was necessary at the time I opened the PR, I am not sure if it has been fixed since then

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants