Skip to content

Error in TP: RuntimeError: get_group_info: no group info associated with the group name #228

@zy-ning

Description

@zy-ning

When I run ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth, it ends up with error:RuntimeError: get_group_info: no group info associated with the group name.

Detailed error information:

W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************
Using device=cuda
Loading model ...
Applying tensor parallel to model ...
Time to load model: 10.60 seconds
/root/serve/gpt-fast/tp.py:139: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead.
  attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 480, in <module>
[rank0]:     main(
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 401, in main
[rank0]:     y, metrics = generate(
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 194, in generate
[rank0]:     next_token = prefill(model, prompt.view(batch_size, -1), input_pos, **sampling_kwargs).clone()
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 71, in prefill
[rank0]:     logits = model(mask, x, input_pos)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/model.py", line 156, in forward
[rank0]:     x = layer(x, input_pos, freqs_cis, mask)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/model.py", line 175, in forward
[rank0]:     h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in inner
[rank0]:     hook_result = hook(self, args, result)
[rank0]:   File "/root/serve/gpt-fast/tp.py", line 139, in <lambda>
[rank0]:     attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce(
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 176, in all_reduce
[rank0]:     tensor = torch.ops._c10d_functional.all_reduce(self, reduceOp.lower(), group_name)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: get_group_info: no group info associated with the group name

In UV venv,
torch version: 2.7.1+cu126

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions