Skip to content

Gradient norm explosion after 40 steps using CISPO algorithm #31

@Linjunjie99

Description

@Linjunjie99

I'm currently training models using the CISPO method, with both dense models (Qwen2.5-7B) and MoE models (Qwen3-30B-A3B).

During my experiments, I've encountered an issue where the gradient norm remains stable at around 0.2 in the early training stages, but suddenly spikes to several tens after 40 update steps. This explosion significantly degrades the model performance in subsequent training.

My setup uses the DAPO algorithm from the verl framework, with the loss function modified to CISPO's loss function. After comparing with the Minimax M1 technical report, I noticed two potential differences in my implementation:

  1. I haven't modified the Adam optimizer parameters as described in the report
  2. I haven't implemented the unified mask-based clipping operation

I have a few questions regarding this:

  • Could these two omissions be the cause of the gradient norm explosion I'm experiencing?
  • The technical report mentions modified optimizer parameters with their motivations, but the gradient norm range in the report doesn't match what I'm observing. Is it still advisable to follow their parameter recommendations?
  • Was the mask-based clipping actually used in Minimax M1 training? I initially understood it as a theoretical unified framework rather than an actual implementation, but some related issues suggest it was used.
  • If mask-based clipping was indeed used, could you provide recommended hyperparameter examples for the clipping operation?

Any guidance or insights into these questions would be greatly appreciated. Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions