Skip to content

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

@collinmccarthy

Description

@collinmccarthy

Prerequisite

Environment

(Issue obvious from source)

Reproduces the problem - code sample

(Issue obvious from source)

Reproduces the problem - command or script

(Issue obvious from source)

Reproduces the problem - error message

(Issue obvious from source)

Additional information

In torch.nn.modules.batchnorm.py the SyncBatchNorm.convert_sync_batchnorm() method copies over the training attribute like module_output.training = module.training.

The mmengine version is missing this. However it is present in the revert_sync_batchnorm() method right above.

Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions