Cross-posting [this issue](https://github.com/intel/intel-extension-for-pytorch/issues/599) from `ipex`, in case the `torch-ccl` team is not aware of it. Key issues: * Compute and collective communications do not overlap on intel GPU devices * Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices) The pytorch profiler traces highlight the issues (copied from the other thread): ## A100 Trace <img width="1491" alt="nvidia_a100_trace" src="https://github.com/intel/torch-ccl/assets/44747910/f86b7311-1734-4091-b8f4-4d2f04ed4e81"> Non-blocking kernel launch and comms/compute overlap. ## Intel Max 1550 Trace <img width="1491" alt="intel_1550_trace" src="https://github.com/intel/torch-ccl/assets/44747910/08bafa4a-e1d6-407f-a0c8-7952feecf0b4"> Blocking kernel launch and no comms/compute overlap. See the other thread for more details.