v1.1.3
This version should work with CoMLRL v1.1.3.
Changelog
- Remove the hard-coded code-level-logging at each node, since we don't expect users to inspect the details during training, at the cost of huge VRAM usage.
- Change the default value of hyperparameters according to the Dr. GRPO style, and learning rate to be 2e-5 and no more bandit external mode since it is equivalent to the magrpo in single-turn.
- Optimize the code format and group the closed params together.
- Add MBPP dataset.
Plain fails 2; expert fails 1; level feedback not fail yet.