Stuck while training InternVL3.5-30B-A3B

Hi，感谢开源~ 有个问题求解~

我在 4 机 8 卡 H20 开 zero-3 训 InternVl3.5-30B-A3B 的时候，一直 hang 住了，log 截图如下

<img width="1007" height="652" alt="Image" src="https://github.com/user-attachments/assets/207fd4a0-23a3-4111-8c9d-9b9d2d7b344d" />

GPU 一直是利用率 100%、显存 10G 的样子，显然是没开始训
<img width="984" height="602" alt="Image" src="https://github.com/user-attachments/assets/def4201b-d043-435b-8dd6-e8db41da1785" />

同样环境，训 4B 的 dense 模型能正常。按照 dense 的训练经验，上述 log 打印之后应该是进入 training step 记录的，结果是一直 hang 着。

不管是 packing 还是 unpacking，都进到上面的 log 然后开始 hang 住，请问怎么解决呢？


InternVL3_5-GPT-OSS-20B-A4B-Preview 模型训练是没问题的，按照提供脚本中的 --use_custom_flash_attn True；上面 hang 的 InternVL3.5-30B-A3B 模型也是按相应脚本里的 --use_custom_flash_attn False。请问 internvl3_5_qwen3 目录下 InternVL3.5-30B-A3B 这个 MOE 模型训练脚本执行是不是还额外有什么特殊配置？超参或者环境？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck while training InternVL3.5-30B-A3B #1193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stuck while training InternVL3.5-30B-A3B #1193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions