-
Notifications
You must be signed in to change notification settings - Fork 420
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).
Environment
(Issue obvious from source)
Reproduces the problem - code sample
(Issue obvious from source)
Reproduces the problem - command or script
(Issue obvious from source)
Reproduces the problem - error message
(Issue obvious from source)
Additional information
In torch.nn.modules.batchnorm.py the SyncBatchNorm.convert_sync_batchnorm()
method copies over the training
attribute like module_output.training = module.training
.
The mmengine version is missing this. However it is present in the revert_sync_batchnorm()
method right above.
Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working