-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
I built torch-ccl using the pip command shown in README: python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu
However, when trying to import it, I get an error:
$ python3.11
Python 3.11.5 (main, Sep 06 2023, 11:21:05) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "/etc/pythonstart", line 7, in <module>
import readline
ModuleNotFoundError: No module named 'readline'
>>> import oneccl_bindings_for_pytorch
terminate called after throwing an instance of 'c10::Error'
what():
Mismatch in kernel C++ signatures
operator: c10d::allreduce_(Tensor[] tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, Tensor? sparse_indices, int timeout) -> (Tensor[], __torch__.torch.classes.c10d.Work)
registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:10
kernel 1: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, c10::optional<at::Tensor> const&, long)
dispatch key: CPU
registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:501
kernel 2: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, long)
dispatch key: HIP
registered at /build/frameworks.ai.pytorch.torch-ccl/src/ProcessGroupCCL.cpp:89
Exception raised from registerKernel at /build/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:120 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9e77527a89 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f9e774e11d4 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x222 (0x7f9e78b35352 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x171 (0x7f9e78b2a191 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & + 0x38e (0x7f9e78b6465e in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x31be5 (0x7f9dd1d06be5 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #6: torch::detail::TorchLibraryInit::TorchLibraryInit(torch::Library::Kind, void (*)(torch::Library&), char const*, c10::optional<c10::DispatchKey>, char const*, unsigned int) + 0xf1 (0x7f9dd1d09f71 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #7: <unknown function> + 0x29842 (0x7f9dd1cfe842 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #8: <unknown function> + 0x111da (0x7f9e8f4061da in /lib64/ld-linux-x86-64.so.2)
frame #9: <unknown function> + 0x112f6 (0x7f9e8f4062f6 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x50 (0x7f9e8e95a11e in /lib64/libc.so.6)
frame #11: <unknown function> + 0x155d6 (0x7f9e8f40a5d6 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #13: <unknown function> + 0x14e0b (0x7f9e8f409e0b in /lib64/ld-linux-x86-64.so.2)
frame #14: <unknown function> + 0x13b6 (0x7f9e8e6013b6 in /lib64/libdl.so.2)
frame #15: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #16: _dl_catch_error + 0x31 (0x7f9e8e95a21f in /lib64/libc.so.6)
frame #17: <unknown function> + 0x1ba5 (0x7f9e8e601ba5 in /lib64/libdl.so.2)
frame #18: dlopen + 0x73 (0x7f9e8e601481 in /lib64/libdl.so.2)
<omitting python frames>
frame #56: __libc_start_main + 0xef (0x7f9e8e8392bd in /lib64/libc.so.6)
frame #57: _start + 0x2c (0x560c8259e7aa in python3.11)
Aborted (core dumped)
Software Details:
- Python3.11
- OneCCL 2021.11.1
- torch 2.1.0a0+cxx11.abi
- intel-extension-for-pytorch 2.1.10+xpu
- oneccl-bind-pt 2.0.100+gpu
- oneapi/release/2023.12.15.001
- intel_compute_runtime/release/stable-736.25
I suspect I'm messing up having compatible versions of the packages. Any suggestions would be helpful. Thanks!
Metadata
Metadata
Assignees
Labels
No labels