Error in TP: RuntimeError: get_group_info: no group info associated with the group name

When I run `ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth`, it ends up with error:`RuntimeError: get_group_info: no group info associated with the group name`.


Detailed error information:
```
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0609 20:18:37.249000 1431440 torch/distributed/run.py:766] *****************************************
Using device=cuda
Loading model ...
Applying tensor parallel to model ...
Time to load model: 10.60 seconds
/root/serve/gpt-fast/tp.py:139: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead.
  attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 480, in <module>
[rank0]:     main(
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 401, in main
[rank0]:     y, metrics = generate(
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 194, in generate
[rank0]:     next_token = prefill(model, prompt.view(batch_size, -1), input_pos, **sampling_kwargs).clone()
[rank0]:   File "/root/serve/gpt-fast/generate.py", line 71, in prefill
[rank0]:     logits = model(mask, x, input_pos)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/model.py", line 156, in forward
[rank0]:     x = layer(x, input_pos, freqs_cis, mask)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/model.py", line 175, in forward
[rank0]:     h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in inner
[rank0]:     hook_result = hook(self, args, result)
[rank0]:   File "/root/serve/gpt-fast/tp.py", line 139, in <lambda>
[rank0]:     attn.register_forward_hook(lambda _module, _input, output: funcol.all_reduce(
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 176, in all_reduce
[rank0]:     tensor = torch.ops._c10d_functional.all_reduce(self, reduceOp.lower(), group_name)
[rank0]:   File "/root/serve/gpt-fast/.venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: get_group_info: no group info associated with the group name
```

In UV venv,
torch version: 2.7.1+cu126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in TP: RuntimeError: get_group_info: no group info associated with the group name #228

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error in TP: RuntimeError: get_group_info: no group info associated with the group name #228

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions