-
Notifications
You must be signed in to change notification settings - Fork 12.5k
SYCL: Add set_rows support for quantized types #14883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code.
The bfloat16 extension states that it's supported in all GPUs, with the caveat that unsupported GPUs emulate the behavior in software: https://github.com/intel/llvm/blob/27dab6ce45c073ffbe7706747d6feee80a94dd49/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc#overview I´d say it's safe to merge from an usability perspective. I haven't seen any mechanism in DPC++ to discern if bfloat16 is supported natively on the device at runtime, so I don't know how any performance implications could be avoided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
SYCL: Add set_rows support for quantized types (ggml-org#14883)
This change adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend.
The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp.
This addresses part of the TODOs mentioned in the code.
Please note: I have also added support for BF16 type as well. I am aware that not all GPUs support it. If there is a way to disable it in device_supports_op without declaring compiler definition such as GGML_SYCL_F16, please let me know. Before that please don't merge it.
Performance comparison
Nearly identical for quantized set_rows but has room for further improvements in the future.
Update: A block size of 256 gives the best result so far on an A750 GPU: