-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
I'm trying to run any of these tests and can't. As far as I can tell, I have everything configured according to the prerequisites and documentation. Everything seems to work until the containers spin up and run. The best I can tell, the actual error is this:
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.25.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Error: network gIB not found.
Does anyone have any advice? I'm trying to do this with FlexStart node pools and I'm not sure if there is a network configuration difference between DENSE provisioned nodes (mentioned in the readme) and FlexStart. Any help is appreciated.
Metadata
Metadata
Assignees
Labels
No labels