-
-
Notifications
You must be signed in to change notification settings - Fork 55
Open
Description
I train the baseline with 1 A100-40G,using ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base_occ.py 1.
After 24epoch,I tried to use ./tools/dist_test.py ./projects/configs/bevformer/bevformer_base_occ.py work_dirs/bevformer_base_occ/epoch_24.pth 1.
After loading checkpoint and evaluate for 6019tasks, I saw the memory increased from18G to 42G, and suddenly it got error: torch.distributed.elastic.multiprocessing.api:failed.
So how can I fix this.
Metadata
Metadata
Assignees
Labels
No labels