Skip to content

Conversation

Hamiltonian-Action
Copy link
Contributor

@Hamiltonian-Action Hamiltonian-Action commented Aug 11, 2025

Well this is a bit embarrassing, I had finished up work on canonicalize and canonicalize_rref on Friday but must have forgotten to push things out and set up this draft.
There are lots of files that have been modified in this branch but most of them are merely structural and relate to code organisation, the actual meaningful changes are found in the following:

  • default_parameters.jl
  • enumerations.jl and fixed_sizes.jl
  • mutex_configuration.jl and mutex_management.jl
  • bit_manipulation.jl
  • scan_step.jl
  • snippets.jl (namely, the new snippets in the catalogue)
  • reductions.jl (namely, reduce_lexicographic_min!)
  • canonicalize.jl and canonicalize_rref.jl

I believe the implementation to be reasonably performant, be work-efficient, as well as being quite flexible for potential extensibility. However, it is also a sinful blight upon this cruel and uncaring universe due to necessitating coding around some terrible ergonomics and missing features. Test files, benchmarks, and an implementation of canonicalize_gott will follow in a suitable time frame.
Side note, the implementation provided can handle both left and right multiplication as well as X-first or Z-first ordering. Is there any consistent notion for what ought to be a ::T argument and what ought to be a ::Val{T} parameter in terms of the public-facing API? Internally they can be whatever the implementation wishes but I would prefer giving this some structured thought so as to avoid having different functions accept different signatures -- e.g: the mul family takes phases::Val{B} whereas the canonicalize functions only accept phases::Bool.

  • The code is properly formatted and commented.
  • Substantial new functionality is documented within the docs.
  • All new functionality is tested.
  • All of the automated tests on github pass.
  • We recently started enforcing formatting checks. If formatting issues are reported in the new code you have written, please correct them. There will be plenty of old code that is flagged as we are slowly transitioning to enforced formatting. Please do not worry about or address older formatting issues -- keep your PR just focused on your planned contribution.

Copy link

codecov bot commented Aug 11, 2025

Codecov Report

❌ Patch coverage is 0% with 753 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.44%. Comparing base (da8a57d) to head (87c8e2f).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...xt/QuantumCliffordKAExt/canonicalization/common.jl 0.00% 214 Missing ⚠️
ext/QuantumCliffordKAExt/mul/new_interface.jl 0.00% 104 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/snippets.jl 0.00% 104 Missing ⚠️
ext/QuantumCliffordKAExt/mul/device_mul.jl 0.00% 62 Missing ⚠️
...liffordKAExt/canonicalization/canonicalize_rref.jl 0.00% 51 Missing ⚠️
...ntumCliffordKAExt/canonicalization/canonicalize.jl 0.00% 49 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/reductions.jl 0.00% 49 Missing ⚠️
...QuantumCliffordKAExt/utilities/mutex_management.jl 0.00% 39 Missing ⚠️
ext/QuantumCliffordKAExt/mul/old_interface.jl 0.00% 36 Missing ⚠️
ext/QuantumCliffordKAExt/utilities/scan_step.jl 0.00% 19 Missing ⚠️
... and 5 more

❗ There is a different number of reports uploaded between BASE (da8a57d) and HEAD (87c8e2f). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (da8a57d) HEAD (87c8e2f)
13 7
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #576       +/-   ##
===========================================
- Coverage   84.40%   70.44%   -13.96%     
===========================================
  Files         114      127       +13     
  Lines        6963     7675      +712     
===========================================
- Hits         5877     5407      -470     
- Misses       1086     2268     +1182     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Krastanov
Copy link
Member

On a cursory skim through this, it looks very impressive. Thankfully you have proven to be very attentive to details and edge cases -- it would be simply impossible for me to truly review the entirety of this code on any reasonable timescale, but thankfully I can rely on your track record, documentation, tests, and the fact that this is well compartmentalized within an extension.

To your question about ::Bool vs Val{true}/Val{false}: The use of Val types guarantees that any type stable/inferred/grounded (not sure which one is the correct name) code will resolve any if/else branches at compile time. On the other hand, if I used Bool, I need to rely on the constant propagation heuristic of the compiler, which is not a hard guarantee of the language. In testing, using Bool led to poor performance in some inner loops, so the internal API uses Val. However, while Val has these nice semantic guarantees, it is a bit embarrassing to use it in user-facing APIs -- it is an implementation detail related to performance leaking into what should be a nice API. So any user-facing API has a Bool that then calls the internal Val methods.

As a rule of thumb, the Val methods exist for being called in tight loops, while the Bool methods exist to be a pleasant user-facing API.

To finally answer your question: if it is public API, use Bool. If Bool and Val compile and perform equally well use Bool. Generally avoid Val unless the compiler forces you to.

It is quite possible that many of the Val methods currently in the codebase are not even necessary anymore. It is also possible that back in the Julia 1.7 era there was already a better way to structure the code without needing the Val methods to begin with, but I just could not think of it.

@Krastanov
Copy link
Member

Not for this PR: I am wondering whether we should run the benchmarks as part of GPU CI, just to make sure they never bitrot? Currently nothing is running them. If we end up doing this, no need to even report their results (that can be a separate task, figuring out a good way for CI to post comments with benchmark results and so on). How slow are the benchmarks right now?

Comment on lines +14 to +21
# WARNING: Atomix has an API for atomic load/store, yet support is missing.
# WARNING: KernelAbstractions does not provide any sort of memory fence.

#==============================================================================
SINFUL IMPLEMENTATION WROUGHT ABOUT BY THE FOLLY OF MANKIND
==============================================================================#

# TODO: Overhaul this entirely once support becomes available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is merged, could you post an issue to our issue tracker explaining that this implementation can potentially be drastically simplified if Atomix/KernelAbstractions gain certain features (could you enumerate them?). If there is a corresponding issue on the Atomix and KernelAbstractions issue trackers, could you link to it? If there is no such issue on their issue trackers, could you create one?

Lastly, in that same issue, could you give a simple example of the type of bugs that canonicalize might have if these fences were not used.

@Krastanov
Copy link
Member

Would the implementation of these functions be simpler if you are permitted to assume the following: The memory behind the array behind the tableau is NOT pointed to by any other handle. I.e. there are never any subtableaux, there are never any views, there is no aliasing.

Of course, that can not be guaranteed, but if the code would be simpler it makes a lot of sense to just say "undefined behavior occurs when canonicalizations are called on aliased data".

@Hamiltonian-Action
Copy link
Contributor Author

Hamiltonian-Action commented Aug 11, 2025

To your question about ::Bool vs Val{true}/Val{false}: The use of Val types guarantees that any type stable/inferred/grounded (not sure which one is the correct name) code will resolve any if/else branches at compile time. On the other hand, if I used Bool, I need to rely on the constant propagation heuristic of the compiler, which is not a hard guarantee of the language. In testing, using Bool led to poor performance in some inner loops, so the internal API uses Val. However, while Val has these nice semantic guarantees, it is a bit embarrassing to use it in user-facing APIs -- it is an implementation detail related to performance leaking into what should be a nice API. So any user-facing API has a Bool that then calls the internal Val methods.

I am aware of the fact that Val propagates as a compile-time constant which allows generating different code paths within the same function[1] and have (ab)used it extensively, the dynamic tuning parameters would not be possible without it. The question is one of whether mul and its peers are artifacts of their time and it would be preferable to shift towards exposing an API with concrete types that then transforms them internally into Val parameters for further processing -- Akin to what is being done with the use of @valbooldispatch. This would increase clarity regarding what the permissible values are without making any meaningful sacrifices, as well as allowing flexibility regarding whether an entirely new compilation step is required for each invocation.

[1] Strictly speaking, multiple instantiations of the same function.

Not for this PR: I am wondering whether we should run the benchmarks as part of GPU CI, just to make sure they never bitrot? Currently nothing is running them. If we end up doing this, no need to even report their results (that can be a separate task, figuring out a good way for CI to post comments with benchmark results and so on). How slow are the benchmarks right now?

As slow or as fast as desired. The benchmark sizes, sample count, and iteration time limit are all exposed via constants defined in a single file.

Would the implementation of these functions be simpler if you are permitted to assume the following: The memory behind the array behind the tableau is NOT pointed to by any other handle. I.e. there are never any subtableaux, there are never any views, there is no aliasing.

Not really. The shenanigans are required because it is necessary to cooperatively decide on which pauli operator at which row ought to be next but there are far too many bits in the transaction for a single atomic to handle [2], as well as the fact that multiple blocks can potentially operate on the same row (depending on the batch size) so they need to make sure that they do not impact each other's work. There is an alternative work-around that does not require atomics but would necessitate allocating around 16 bytes of auxiliary memory per thread block, from the previous work on multiplication you seemed to indicate it would be preferable to avoid doing so.

[2] Not entirely true. Using bit manipulation, it is possible to pack all the data into 16 bytes and then utilise a single atomic min. To the best of my knowledge, only NVIDIA supports transactions of this size and such an implementation would not be portable. As an extra titbit that I found out whilst untangling the mess caused by OpenCL, they (JuliaGPU) only implemented support for 8 byte atomics just last week!

@Hamiltonian-Action
Copy link
Contributor Author

Hamiltonian-Action commented Aug 12, 2025

There is an alternative work-around that does not require atomics but would necessitate allocating around 16 bytes of auxiliary memory per thread block, from the previous work on multiplication you seemed to indicate it would be preferable to avoid doing so.

Pondering this further, there is a mutex-free implementation using 8 bytes per row rather than per thread block plus a single atomic min. It would also reduce redundant work and bandwidth consumption substantially, only requiring a new pass whenever multiplication is conducted.

@Hamiltonian-Action
Copy link
Contributor Author

@Krastanov, any thoughts on the above proposition of 8 bytes per row as scratch space? It would simplify the implementation a ton (something you would appreciate) and likely increase the performance by a considerable amount (something I would appreciate and benchmark), or is minimising memory consumption still of utmost importance?

@Krastanov
Copy link
Member

I like your suggestion. Memory consumption is probably the least important constraint in how this library is usually used.

@Hamiltonian-Action
Copy link
Contributor Author

  1. Test suite and benchmark scripts are all done but are getting more and more hideous with each incremental update. Will open another PR later that introduces an Adapt.jl extension and a few quality of life changes that make them more palpable. Furthermore, whilst I am not fundamentally opposed to the concept of a garbage-collected language, Julia and JuliaGPU make maintaining a sensible memory footprint quite a pain since the latter appears opaque to the former.

  2. The major building blocks for canonicalize_gott are already mostly present with the existing mul_and_scan! kernel, but the qubit shuffling is proving to be tricky since it has a non-negligible sequential element. I have a few ideas on how to go about doing this with a meet-in-the-middle approach but it would require allocating a tableau of equivalent size for use as intermediate storage since there is no performant way to re-utilise space due to the lack of runtime concurrency information -- something I have whined and will continue to whine about all over the JuliaGPU space.

@Krastanov
Copy link
Member

Great news!

Julia and JuliaGPU make maintaining a sensible memory footprint quite a pain since the latter appears opaque to the former

Yeah :) If you have the time to make small examples of issues with very succinctly described rationale (it is really important to be succinct and trivially easy to understand in these), the JuliaGPU maintainers would certainly appreciate such issue reports (maybe more of an aspiration "programming language theory" feature requests).

More practically, in the context of this library: do not hesitate to pay a 20% performance penalty for code that is half the length (and much simpler) compared to the more performant version, especially if this will save you time. Simplicity (with documented issues and comments explaining that there might be a faster method) is of great value.

@Hamiltonian-Action Hamiltonian-Action force-pushed the KernelAbstractions_canonicalization branch from ce32a1c to e19e58a Compare August 22, 2025 16:33
@Hamiltonian-Action
Copy link
Contributor Author

Hamiltonian-Action commented Aug 22, 2025

Good news: The benchmark suite is now much more robust and dynamic in how it tackles long-running tasks. There's now a triggered threshold after which it will extrapolate from previous data rather than actually conduct any further runs, as well as a simple toggle to just perform O(n^k) scaling if there are insufficient previous data points for a genuine fit.

Bad news: I just realised a major oversight in the implemented calls since my struggles with the JuliaGPU ecosystem made me forget that (Mixed)Destabilizers exist, the present routines will only modify the Stabilizer section but completely disregard updating the accompanying Destabilizer generators.

Good news: The solution is simply one divide-and-conquer xor reduction before each mul_and_scan! kernel call.

Bad news: I will be travelling this week and may not find the time to implement said workaround.

@Krastanov
Copy link
Member

Thanks for the update. This sounds great! The schedule you are suggesting is perfectly reasonable, no worries about that.

@Hamiltonian-Action
Copy link
Contributor Author

Apologies for the prolonged absence, a lot has been going this past month between the nightmare of hunting for suitable accommodation and the Sisyphean climb against administrative bureaucracy. Hopefully the situation will be clearing up come early October so I can finally get some clearance to continue working on this draft.
Please spare a moment of silence for the shattered remains of the projected timeline.

@Krastanov
Copy link
Member

No worries, @Hamiltonian-Action , your contributions are very valuable. Let me know if there is anything I can help with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants