Skip to content

Network gIB Issues #22

@samueljmello

Description

@samueljmello

I'm trying to run any of these tests and can't. As far as I can tell, I have everything configured according to the prerequisites and documentation. Everything seems to work until the containers spin up and run. The best I can tell, the actual error is this:

torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.25.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Error: network gIB not found.

Does anyone have any advice? I'm trying to do this with FlexStart node pools and I'm not sure if there is a network configuration difference between DENSE provisioned nodes (mentioned in the readme) and FlexStart. Any help is appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions