Gradient norm explosion after 40 steps using CISPO algorithm

I'm currently training models using the CISPO method, with both dense models (Qwen2.5-7B) and MoE models (Qwen3-30B-A3B). 

During my experiments, I've encountered an issue where the gradient norm remains stable at around 0.2 in the early training stages, but suddenly spikes to several tens after 40 update steps. This explosion significantly degrades the model performance in subsequent training.

My setup uses the DAPO algorithm from the verl framework, with the loss function modified to CISPO's loss function. After comparing with the Minimax M1 technical report, I noticed two potential differences in my implementation:

1. I haven't modified the Adam optimizer parameters as described in the report
2. I haven't implemented the unified mask-based clipping operation

I have a few questions regarding this:

- Could these two omissions be the cause of the gradient norm explosion I'm experiencing?
- The technical report mentions modified optimizer parameters with their motivations, but the gradient norm range in the report doesn't match what I'm observing. Is it still advisable to follow their parameter recommendations?
- Was the mask-based clipping actually used in Minimax M1 training? I initially understood it as a theoretical unified framework rather than an actual implementation, but some related issues suggest it was used.
- If mask-based clipping was indeed used, could you provide recommended hyperparameter examples for the clipping operation?

Any guidance or insights into these questions would be greatly appreciated. Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient norm explosion after 40 steps using CISPO algorithm #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradient norm explosion after 40 steps using CISPO algorithm #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions