Skip to content

Import error after building with pip #59

@suyashbakshi

Description

@suyashbakshi

I built torch-ccl using the pip command shown in README: python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu

However, when trying to import it, I get an error:

$ python3.11
Python 3.11.5 (main, Sep 06 2023, 11:21:05) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/etc/pythonstart", line 7, in <module>
    import readline
ModuleNotFoundError: No module named 'readline'

>>> import oneccl_bindings_for_pytorch
terminate called after throwing an instance of 'c10::Error'
  what():
Mismatch in kernel C++ signatures
  operator: c10d::allreduce_(Tensor[] tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, Tensor? sparse_indices, int timeout) -> (Tensor[], __torch__.torch.classes.c10d.Work)
    registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:10
  kernel 1: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, c10::optional<at::Tensor> const&, long)
    dispatch key: CPU
    registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:501
  kernel 2: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, long)
    dispatch key: HIP
    registered at /build/frameworks.ai.pytorch.torch-ccl/src/ProcessGroupCCL.cpp:89

Exception raised from registerKernel at /build/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:120 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9e77527a89 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f9e774e11d4 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x222 (0x7f9e78b35352 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x171 (0x7f9e78b2a191 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & + 0x38e (0x7f9e78b6465e in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x31be5 (0x7f9dd1d06be5 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #6: torch::detail::TorchLibraryInit::TorchLibraryInit(torch::Library::Kind, void (*)(torch::Library&), char const*, c10::optional<c10::DispatchKey>, char const*, unsigned int) + 0xf1 (0x7f9dd1d09f71 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #7: <unknown function> + 0x29842 (0x7f9dd1cfe842 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #8: <unknown function> + 0x111da (0x7f9e8f4061da in /lib64/ld-linux-x86-64.so.2)
frame #9: <unknown function> + 0x112f6 (0x7f9e8f4062f6 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x50 (0x7f9e8e95a11e in /lib64/libc.so.6)
frame #11: <unknown function> + 0x155d6 (0x7f9e8f40a5d6 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #13: <unknown function> + 0x14e0b (0x7f9e8f409e0b in /lib64/ld-linux-x86-64.so.2)
frame #14: <unknown function> + 0x13b6 (0x7f9e8e6013b6 in /lib64/libdl.so.2)
frame #15: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #16: _dl_catch_error + 0x31 (0x7f9e8e95a21f in /lib64/libc.so.6)
frame #17: <unknown function> + 0x1ba5 (0x7f9e8e601ba5 in /lib64/libdl.so.2)
frame #18: dlopen + 0x73 (0x7f9e8e601481 in /lib64/libdl.so.2)
<omitting python frames>
frame #56: __libc_start_main + 0xef (0x7f9e8e8392bd in /lib64/libc.so.6)
frame #57: _start + 0x2c (0x560c8259e7aa in python3.11)

Aborted (core dumped)

Software Details:

  • Python3.11
  • OneCCL 2021.11.1
  • torch 2.1.0a0+cxx11.abi
  • intel-extension-for-pytorch 2.1.10+xpu
  • oneccl-bind-pt 2.0.100+gpu
  • oneapi/release/2023.12.15.001
  • intel_compute_runtime/release/stable-736.25

I suspect I'm messing up having compatible versions of the packages. Any suggestions would be helpful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions