v1.1.3

LovelyBuggies released this 05 Oct 00:46

· 20 commits to main since this release

d775bec

This version should work with CoMLRL v1.1.3.

Changelog

Remove the hard-coded code-level-logging at each node, since we don't expect users to inspect the details during training, at the cost of huge VRAM usage.
Change the default value of hyperparameters according to the Dr. GRPO style, and learning rate to be 2e-5 and no more bandit external mode since it is equivalent to the magrpo in single-turn.
Optimize the code format and group the closed params together.
Add MBPP dataset.

Plain fails 2; expert fails 1; level feedback not fail yet.

Assets 2