Implement KernelAbstractions accelerated canonicalization routines. #576

Hamiltonian-Action · 2025-08-11T17:14:30Z

Well this is a bit embarrassing, I had finished up work on canonicalize and canonicalize_rref on Friday but must have forgotten to push things out and set up this draft.
There are lots of files that have been modified in this branch but most of them are merely structural and relate to code organisation, the actual meaningful changes are found in the following:

default_parameters.jl
enumerations.jl and fixed_sizes.jl
mutex_configuration.jl and mutex_management.jl
bit_manipulation.jl
scan_step.jl
snippets.jl (namely, the new snippets in the catalogue)
reductions.jl (namely, reduce_lexicographic_min!)
canonicalize.jl and canonicalize_rref.jl

I believe the implementation to be reasonably performant, be work-efficient, as well as being quite flexible for potential extensibility. However, it is also a sinful blight upon this cruel and uncaring universe due to necessitating coding around some terrible ergonomics and missing features. Test files, benchmarks, and an implementation of canonicalize_gott will follow in a suitable time frame.
Side note, the implementation provided can handle both left and right multiplication as well as X-first or Z-first ordering. Is there any consistent notion for what ought to be a ::T argument and what ought to be a ::Val{T} parameter in terms of the public-facing API? Internally they can be whatever the implementation wishes but I would prefer giving this some structured thought so as to avoid having different functions accept different signatures -- e.g: the mul family takes phases::Val{B} whereas the canonicalize functions only accept phases::Bool.

The code is properly formatted and commented.
Substantial new functionality is documented within the docs.
All new functionality is tested.
All of the automated tests on github pass.
We recently started enforcing formatting checks. If formatting issues are reported in the new code you have written, please correct them. There will be plenty of old code that is flagged as we are slowly transitioning to enforced formatting. Please do not worry about or address older formatting issues -- keep your PR just focused on your planned contribution.

codecov · 2025-08-11T17:36:13Z

Codecov Report

❌ Patch coverage is 0% with 753 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.44%. Comparing base (da8a57d) to head (87c8e2f).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...xt/QuantumCliffordKAExt/canonicalization/common.jl	0.00%	214 Missing ⚠️
ext/QuantumCliffordKAExt/mul/new_interface.jl	0.00%	104 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/snippets.jl	0.00%	104 Missing ⚠️
ext/QuantumCliffordKAExt/mul/device_mul.jl	0.00%	62 Missing ⚠️
...liffordKAExt/canonicalization/canonicalize_rref.jl	0.00%	51 Missing ⚠️
...ntumCliffordKAExt/canonicalization/canonicalize.jl	0.00%	49 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/reductions.jl	0.00%	49 Missing ⚠️
...QuantumCliffordKAExt/utilities/mutex_management.jl	0.00%	39 Missing ⚠️
ext/QuantumCliffordKAExt/mul/old_interface.jl	0.00%	36 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/scan_step.jl	0.00%	19 Missing ⚠️
... and 5 more

❗ There is a different number of reports uploaded between BASE (da8a57d) and HEAD (87c8e2f). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (da8a57d) HEAD (87c8e2f)

13 7

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #576       +/-   ##
===========================================
- Coverage   84.40%   70.44%   -13.96%     
===========================================
  Files         114      127       +13     
  Lines        6963     7675      +712     
===========================================
- Hits         5877     5407      -470     
- Misses       1086     2268     +1182

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Krastanov · 2025-08-11T17:40:05Z

On a cursory skim through this, it looks very impressive. Thankfully you have proven to be very attentive to details and edge cases -- it would be simply impossible for me to truly review the entirety of this code on any reasonable timescale, but thankfully I can rely on your track record, documentation, tests, and the fact that this is well compartmentalized within an extension.

To your question about ::Bool vs Val{true}/Val{false}: The use of Val types guarantees that any type stable/inferred/grounded (not sure which one is the correct name) code will resolve any if/else branches at compile time. On the other hand, if I used Bool, I need to rely on the constant propagation heuristic of the compiler, which is not a hard guarantee of the language. In testing, using Bool led to poor performance in some inner loops, so the internal API uses Val. However, while Val has these nice semantic guarantees, it is a bit embarrassing to use it in user-facing APIs -- it is an implementation detail related to performance leaking into what should be a nice API. So any user-facing API has a Bool that then calls the internal Val methods.

As a rule of thumb, the Val methods exist for being called in tight loops, while the Bool methods exist to be a pleasant user-facing API.

To finally answer your question: if it is public API, use Bool. If Bool and Val compile and perform equally well use Bool. Generally avoid Val unless the compiler forces you to.

It is quite possible that many of the Val methods currently in the codebase are not even necessary anymore. It is also possible that back in the Julia 1.7 era there was already a better way to structure the code without needing the Val methods to begin with, but I just could not think of it.

Krastanov · 2025-08-11T17:44:41Z

Not for this PR: I am wondering whether we should run the benchmarks as part of GPU CI, just to make sure they never bitrot? Currently nothing is running them. If we end up doing this, no need to even report their results (that can be a separate task, figuring out a good way for CI to post comments with benchmark results and so on). How slow are the benchmarks right now?

Krastanov · 2025-08-11T17:51:10Z

ext/QuantumCliffordKAExt/utilities/mutex_management.jl

+# WARNING: Atomix has an API for atomic load/store, yet support is missing.
+# WARNING: KernelAbstractions does not provide any sort of memory fence.
+
+#==============================================================================
+SINFUL IMPLEMENTATION WROUGHT ABOUT BY THE FOLLY OF MANKIND
+==============================================================================#
+
+# TODO: Overhaul this entirely once support becomes available.


Once this is merged, could you post an issue to our issue tracker explaining that this implementation can potentially be drastically simplified if Atomix/KernelAbstractions gain certain features (could you enumerate them?). If there is a corresponding issue on the Atomix and KernelAbstractions issue trackers, could you link to it? If there is no such issue on their issue trackers, could you create one?

Lastly, in that same issue, could you give a simple example of the type of bugs that canonicalize might have if these fences were not used.

Krastanov · 2025-08-11T17:54:37Z

Would the implementation of these functions be simpler if you are permitted to assume the following: The memory behind the array behind the tableau is NOT pointed to by any other handle. I.e. there are never any subtableaux, there are never any views, there is no aliasing.

Of course, that can not be guaranteed, but if the code would be simpler it makes a lot of sense to just say "undefined behavior occurs when canonicalizations are called on aliased data".

Hamiltonian-Action · 2025-08-11T18:50:03Z

To your question about ::Bool vs Val{true}/Val{false}: The use of Val types guarantees that any type stable/inferred/grounded (not sure which one is the correct name) code will resolve any if/else branches at compile time. On the other hand, if I used Bool, I need to rely on the constant propagation heuristic of the compiler, which is not a hard guarantee of the language. In testing, using Bool led to poor performance in some inner loops, so the internal API uses Val. However, while Val has these nice semantic guarantees, it is a bit embarrassing to use it in user-facing APIs -- it is an implementation detail related to performance leaking into what should be a nice API. So any user-facing API has a Bool that then calls the internal Val methods.

I am aware of the fact that Val propagates as a compile-time constant which allows generating different code paths within the same function[1] and have (ab)used it extensively, the dynamic tuning parameters would not be possible without it. The question is one of whether mul and its peers are artifacts of their time and it would be preferable to shift towards exposing an API with concrete types that then transforms them internally into Val parameters for further processing -- Akin to what is being done with the use of @valbooldispatch. This would increase clarity regarding what the permissible values are without making any meaningful sacrifices, as well as allowing flexibility regarding whether an entirely new compilation step is required for each invocation.

[1] Strictly speaking, multiple instantiations of the same function.

Not for this PR: I am wondering whether we should run the benchmarks as part of GPU CI, just to make sure they never bitrot? Currently nothing is running them. If we end up doing this, no need to even report their results (that can be a separate task, figuring out a good way for CI to post comments with benchmark results and so on). How slow are the benchmarks right now?

As slow or as fast as desired. The benchmark sizes, sample count, and iteration time limit are all exposed via constants defined in a single file.

Would the implementation of these functions be simpler if you are permitted to assume the following: The memory behind the array behind the tableau is NOT pointed to by any other handle. I.e. there are never any subtableaux, there are never any views, there is no aliasing.

Not really. The shenanigans are required because it is necessary to cooperatively decide on which pauli operator at which row ought to be next but there are far too many bits in the transaction for a single atomic to handle [2], as well as the fact that multiple blocks can potentially operate on the same row (depending on the batch size) so they need to make sure that they do not impact each other's work. There is an alternative work-around that does not require atomics but would necessitate allocating around 16 bytes of auxiliary memory per thread block, from the previous work on multiplication you seemed to indicate it would be preferable to avoid doing so.

[2] Not entirely true. Using bit manipulation, it is possible to pack all the data into 16 bytes and then utilise a single atomic min. To the best of my knowledge, only NVIDIA supports transactions of this size and such an implementation would not be portable. As an extra titbit that I found out whilst untangling the mess caused by OpenCL, they (JuliaGPU) only implemented support for 8 byte atomics just last week!

Hamiltonian-Action · 2025-08-12T18:17:41Z

There is an alternative work-around that does not require atomics but would necessitate allocating around 16 bytes of auxiliary memory per thread block, from the previous work on multiplication you seemed to indicate it would be preferable to avoid doing so.

Pondering this further, there is a mutex-free implementation using 8 bytes per row rather than per thread block plus a single atomic min. It would also reduce redundant work and bandwidth consumption substantially, only requiring a new pass whenever multiplication is conducted.

Hamiltonian-Action · 2025-08-15T16:27:12Z

@Krastanov, any thoughts on the above proposition of 8 bytes per row as scratch space? It would simplify the implementation a ton (something you would appreciate) and likely increase the performance by a considerable amount (something I would appreciate and benchmark), or is minimising memory consumption still of utmost importance?

Krastanov · 2025-08-15T16:28:49Z

I like your suggestion. Memory consumption is probably the least important constraint in how this library is usually used.

Hamiltonian-Action · 2025-08-18T16:03:32Z

Test suite and benchmark scripts are all done but are getting more and more hideous with each incremental update. Will open another PR later that introduces an Adapt.jl extension and a few quality of life changes that make them more palpable. Furthermore, whilst I am not fundamentally opposed to the concept of a garbage-collected language, Julia and JuliaGPU make maintaining a sensible memory footprint quite a pain since the latter appears opaque to the former.
The major building blocks for canonicalize_gott are already mostly present with the existing mul_and_scan! kernel, but the qubit shuffling is proving to be tricky since it has a non-negligible sequential element. I have a few ideas on how to go about doing this with a meet-in-the-middle approach but it would require allocating a tableau of equivalent size for use as intermediate storage since there is no performant way to re-utilise space due to the lack of runtime concurrency information -- something I have whined and will continue to whine about all over the JuliaGPU space.

Krastanov · 2025-08-18T16:08:14Z

Great news!

Julia and JuliaGPU make maintaining a sensible memory footprint quite a pain since the latter appears opaque to the former

Yeah :) If you have the time to make small examples of issues with very succinctly described rationale (it is really important to be succinct and trivially easy to understand in these), the JuliaGPU maintainers would certainly appreciate such issue reports (maybe more of an aspiration "programming language theory" feature requests).

More practically, in the context of this library: do not hesitate to pay a 20% performance penalty for code that is half the length (and much simpler) compared to the more performant version, especially if this will save you time. Simplicity (with documented issues and comments explaining that there might be a faster method) is of great value.

…ntroduce benchmark and test scripts for canonicalization routines.

Hamiltonian-Action · 2025-08-22T16:52:05Z

Good news: The benchmark suite is now much more robust and dynamic in how it tackles long-running tasks. There's now a triggered threshold after which it will extrapolate from previous data rather than actually conduct any further runs, as well as a simple toggle to just perform O(n^k) scaling if there are insufficient previous data points for a genuine fit.

Bad news: I just realised a major oversight in the implemented calls since my struggles with the JuliaGPU ecosystem made me forget that (Mixed)Destabilizers exist, the present routines will only modify the Stabilizer section but completely disregard updating the accompanying Destabilizer generators.

Good news: The solution is simply one divide-and-conquer xor reduction before each mul_and_scan! kernel call.

Bad news: I will be travelling this week and may not find the time to implement said workaround.

Krastanov · 2025-08-22T17:31:37Z

Thanks for the update. This sounds great! The schedule you are suggesting is perfectly reasonable, no worries about that.

Hamiltonian-Action · 2025-09-21T16:35:57Z

Apologies for the prolonged absence, a lot has been going this past month between the nightmare of hunting for suitable accommodation and the Sisyphean climb against administrative bureaucracy. Hopefully the situation will be clearing up come early October so I can finally get some clearance to continue working on this draft.
Please spare a moment of silence for the shattered remains of the projected timeline.

Krastanov · 2025-09-21T19:52:46Z

No worries, @Hamiltonian-Action , your contributions are very valuable. Let me know if there is anything I can help with.

Krastanov reviewed Aug 11, 2025

View reviewed changes

Hamiltonian-Action added 2 commits August 20, 2025 19:41

Implement canonicalize and canonicalize_rref routines.

d6cb9ff

Introduce consistent interface. Overhaul test and benchmark suites. I…

e19e58a

…ntroduce benchmark and test scripts for canonicalization routines.

Hamiltonian-Action force-pushed the KernelAbstractions_canonicalization branch from ce32a1c to e19e58a Compare August 22, 2025 16:33

Correct typo (preferance => preference).

87c8e2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement KernelAbstractions accelerated canonicalization routines. #576

Implement KernelAbstractions accelerated canonicalization routines. #576

Uh oh!

Hamiltonian-Action commented Aug 11, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Krastanov Aug 11, 2025

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Hamiltonian-Action commented Aug 11, 2025 •

edited

Loading

Uh oh!

Hamiltonian-Action commented Aug 12, 2025 •

edited

Loading

Uh oh!

Hamiltonian-Action commented Aug 15, 2025

Uh oh!

Krastanov commented Aug 15, 2025

Uh oh!

Hamiltonian-Action commented Aug 18, 2025

Uh oh!

Krastanov commented Aug 18, 2025

Uh oh!

Hamiltonian-Action commented Aug 22, 2025 •

edited

Loading

Uh oh!

Krastanov commented Aug 22, 2025

Uh oh!

Hamiltonian-Action commented Sep 21, 2025

Uh oh!

Krastanov commented Sep 21, 2025

Uh oh!

Uh oh!

Implement KernelAbstractions accelerated canonicalization routines. #576

Are you sure you want to change the base?

Implement KernelAbstractions accelerated canonicalization routines. #576

Uh oh!

Conversation

Hamiltonian-Action commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Krastanov Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Krastanov commented Aug 11, 2025

Uh oh!

Hamiltonian-Action commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hamiltonian-Action commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hamiltonian-Action commented Aug 15, 2025

Uh oh!

Krastanov commented Aug 15, 2025

Uh oh!

Hamiltonian-Action commented Aug 18, 2025

Uh oh!

Krastanov commented Aug 18, 2025

Uh oh!

Hamiltonian-Action commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Krastanov commented Aug 22, 2025

Uh oh!

Hamiltonian-Action commented Sep 21, 2025

Uh oh!

Krastanov commented Sep 21, 2025

Uh oh!

Uh oh!

Hamiltonian-Action commented Aug 11, 2025 •

edited

Loading

codecov bot commented Aug 11, 2025 •

edited

Loading

Hamiltonian-Action commented Aug 11, 2025 •

edited

Loading

Hamiltonian-Action commented Aug 12, 2025 •

edited

Loading

Hamiltonian-Action commented Aug 22, 2025 •

edited

Loading