- 
                Notifications
    
You must be signed in to change notification settings  - Fork 257
 
Description
I'm currently training models using the CISPO method, with both dense models (Qwen2.5-7B) and MoE models (Qwen3-30B-A3B).
During my experiments, I've encountered an issue where the gradient norm remains stable at around 0.2 in the early training stages, but suddenly spikes to several tens after 40 update steps. This explosion significantly degrades the model performance in subsequent training.
My setup uses the DAPO algorithm from the verl framework, with the loss function modified to CISPO's loss function. After comparing with the Minimax M1 technical report, I noticed two potential differences in my implementation:
- I haven't modified the Adam optimizer parameters as described in the report
 - I haven't implemented the unified mask-based clipping operation
 
I have a few questions regarding this:
- Could these two omissions be the cause of the gradient norm explosion I'm experiencing?
 - The technical report mentions modified optimizer parameters with their motivations, but the gradient norm range in the report doesn't match what I'm observing. Is it still advisable to follow their parameter recommendations?
 - Was the mask-based clipping actually used in Minimax M1 training? I initially understood it as a theoretical unified framework rather than an actual implementation, but some related issues suggest it was used.
 - If mask-based clipping was indeed used, could you provide recommended hyperparameter examples for the clipping operation?
 
Any guidance or insights into these questions would be greatly appreciated. Thank you for your help!