Skip to content

SYCL: Add set_rows support for quantized types #14883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 28, 2025

Conversation

qnixsynapse
Copy link
Collaborator

@qnixsynapse qnixsynapse commented Jul 26, 2025

This change adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

Please note: I have also added support for BF16 type as well. I am aware that not all GPUs support it. If there is a way to disable it in device_supports_op without declaring compiler definition such as GGML_SYCL_F16, please let me know. Before that please don't merge it.

Performance comparison

Model Batch size Test t/s master (LLAMA_SET_ROWS=0) t/s sycl/set_rows_q_n_bf16 (LLAMA_SET_ROWS=1) Speedup
qwen3 1.7B Q8_0 64 pp1024 720.70 717.94 1.00
qwen3 1.7B Q8_0 128 pp1024 1328.80 1333.39 1.00
qwen3 1.7B Q8_0 256 pp1024 2367.32 2350.50 0.99
qwen3 1.7B Q8_0 512 pp1024 3806.25 3762.84 0.99
qwen3 1.7B Q8_0 1024 pp1024 3858.31 3798.22 0.98

Nearly identical for quantized set_rows but has room for further improvements in the future.

Update: A block size of 256 gives the best result so far on an A750 GPU:

Model Batch size Test t/s master (LLAMA_SET_ROWS=0) t/s sycl/set_rows_q_n_bf16 (LLAMA_SET_ROWS=1) Speedup
qwen3 1.7B Q8_0 64 pp1024 719.77 719.82 1.00
qwen3 1.7B Q8_0 128 pp1024 1332.75 1336.66 1.00
qwen3 1.7B Q8_0 256 pp1024 2365.29 2348.92 0.99
qwen3 1.7B Q8_0 512 pp1024 3805.34 3782.11 0.99
qwen3 1.7B Q8_0 1024 pp1024 3854.83 3825.46 0.99

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 26, 2025
@Alcpz
Copy link
Collaborator

Alcpz commented Jul 28, 2025

The bfloat16 extension states that it's supported in all GPUs, with the caveat that unsupported GPUs emulate the behavior in software: https://github.com/intel/llvm/blob/27dab6ce45c073ffbe7706747d6feee80a94dd49/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc#overview

I´d say it's safe to merge from an usability perspective. I haven't seen any mechanism in DPC++ to discern if bfloat16 is supported natively on the device at runtime, so I don't know how any performance implications could be avoided.

Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@qnixsynapse qnixsynapse merged commit cd1fce6 into master Jul 28, 2025
47 checks passed
@qnixsynapse qnixsynapse deleted the sycl/set_rows_q_n_bf16 branch July 28, 2025 15:02
BradHutchings added a commit to BradHutchings/Mmojo-Server that referenced this pull request Jul 28, 2025
SYCL: Add set_rows support for quantized types  (ggml-org#14883)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants