MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924

kahyunnam · 2025-10-13T22:24:24Z

📌 Description

Generalize the existing MLA RoPE+Q fused kernels to support GQA/MHA problem shapes.

🔍 Test Results

pytest -v tests/attention/test_rope.py::test_generalized_rope_quantize

============================================================================= 312 passed in 3.93s ==============================================================================

python benchmarks/bench_rope_quantize_fp8.py

Running MLA benchmark...
Running GQA benchmark...
Running MHA benchmark...

=== Summary Table ===
Tokens   MLA (ms)   GQA (ms)   MHA (ms)
----------------------------------------
1        0.00225    0.00174    0.00164
2        0.00225    0.00164    0.00174
4        0.00236    0.00164    0.00174
8        0.00256    0.00174    0.00184
16       0.00287    0.00184    0.00195
32       0.00440    0.00195    0.00215
64       0.00748    0.00215    0.00246
128      0.01270    0.00256    0.00307
256      0.02304    0.00348    0.00461
384      0.03799    0.00440    0.00594
512      0.17039    0.00532    0.00748
768      0.28539    0.00748    0.00993

Configuration details:
  MLA: 128 Q heads, 1 K head, 64+512 dims
  GQA: 32 Q heads, 8 K heads, 64+64 dims
  MHA: 32 Q heads, 32 K heads, 64+64 dims

Plot files saved to current directory:
  mla-rope-benchmark.png
  gqa-rope-benchmark.png
  mha-rope-benchmark.png

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

…OPE_DIM

csrc/rope.cu

pavanimajety · 2025-10-16T00:43:26Z

@nvpohanh for another set of eyes.

pavanimajety · 2025-10-16T00:44:16Z

csrc/rope.cu

-                       TensorView q_nope_out, TensorView k_nope_out, TensorView cos_sin_cache,
-                       TensorView pos_ids, double quant_scale_q, double quant_scale_kv,
-                       bool interleave) {
+void rope_quantize(TensorView q_rope_in, TensorView k_rope_in, TensorView q_nope_in,


A doc string explaining what the method does would be helpful here.

pavanimajety · 2025-10-16T00:45:47Z

benchmarks/bench_rope_quantize_fp8.py

+if __name__ == "__main__":
+    # Run all benchmarks and generate individual plots
+    print("Running MLA benchmark...")
+    benchmark_mla.run(print_data=False, show_plots=True, save_path=".")


Time permitting it would be nice to compare against existing Flashinfer Rope + Quant in torch native as baseline for these measurements

pavanimajety · 2025-10-16T00:47:15Z

flashinfer/rope.py

    mutates_args=("q_rope_out", "k_rope_out", "q_nope_out", "k_nope_out"),
 )
-def _mla_rope_quantize(
+def _rope_quantize(


Nit: Please add doc string here as well.

pavanimajety

LGTM, mostly just nits for documentation and benchmark. Thanks for the effort!

kahyunnam force-pushed the knam/RoPe-fusion branch from fa658e2 to bd2a338 Compare October 14, 2025 00:58

kahyunnam added 4 commits October 13, 2025 21:01

rebase; merge conflict fix

ba59eb7

simply constexpr in MLARopeQuantize launcher using #define DISPATCH_R…

9e8f534

…OPE_DIM

simplify testing

5278495

clean up + rename to remove 'mla'

0fc5d11

kahyunnam force-pushed the knam/RoPe-fusion branch from bd2a338 to 0fc5d11 Compare October 14, 2025 04:05

rebase conflicts cont

88d13c5

kahyunnam changed the title ~~[not ready for review! draft.] MLA RoPE + quantization kernel generalization for MHA / GQA~~ MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA Oct 14, 2025

fix clang format

f5dc938

kahyunnam marked this pull request as ready for review October 14, 2025 04:48

add more tests

4088ec8

kahyunnam enabled auto-merge (squash) October 14, 2025 17:37

kahyunnam requested a review from yzh119 October 14, 2025 17:40

pavanimajety reviewed Oct 14, 2025

View reviewed changes

csrc/rope.cu Show resolved Hide resolved

pavanimajety mentioned this pull request Oct 14, 2025

[Performance]: ROPE + KV-Cache-Write + pre-attn prepare-ops fusion vllm-project/vllm#24678

Open

add data type checks

234e52e

pavanimajety reviewed Oct 16, 2025

View reviewed changes

pavanimajety approved these changes Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924

Uh oh!

kahyunnam commented Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

pavanimajety commented Oct 16, 2025

Uh oh!

pavanimajety Oct 16, 2025

Uh oh!

pavanimajety Oct 16, 2025

Uh oh!

pavanimajety Oct 16, 2025

Uh oh!

pavanimajety left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924

Are you sure you want to change the base?

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924

Uh oh!

Conversation

kahyunnam commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Test Results

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Uh oh!

Uh oh!

pavanimajety commented Oct 16, 2025

Uh oh!

pavanimajety Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kahyunnam commented Oct 13, 2025 •

edited

Loading