-
Notifications
You must be signed in to change notification settings - Fork 65
Implement KernelAbstractions accelerated canonicalization routines. #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement KernelAbstractions accelerated canonicalization routines. #576
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #576 +/- ##
===========================================
- Coverage 84.40% 70.44% -13.96%
===========================================
Files 114 127 +13
Lines 6963 7675 +712
===========================================
- Hits 5877 5407 -470
- Misses 1086 2268 +1182 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
On a cursory skim through this, it looks very impressive. Thankfully you have proven to be very attentive to details and edge cases -- it would be simply impossible for me to truly review the entirety of this code on any reasonable timescale, but thankfully I can rely on your track record, documentation, tests, and the fact that this is well compartmentalized within an extension. To your question about ::Bool vs Val{true}/Val{false}: The use of As a rule of thumb, the Val methods exist for being called in tight loops, while the Bool methods exist to be a pleasant user-facing API. To finally answer your question: if it is public API, use Bool. If Bool and Val compile and perform equally well use Bool. Generally avoid Val unless the compiler forces you to. It is quite possible that many of the Val methods currently in the codebase are not even necessary anymore. It is also possible that back in the Julia 1.7 era there was already a better way to structure the code without needing the Val methods to begin with, but I just could not think of it. |
Not for this PR: I am wondering whether we should run the benchmarks as part of GPU CI, just to make sure they never bitrot? Currently nothing is running them. If we end up doing this, no need to even report their results (that can be a separate task, figuring out a good way for CI to post comments with benchmark results and so on). How slow are the benchmarks right now? |
# WARNING: Atomix has an API for atomic load/store, yet support is missing. | ||
# WARNING: KernelAbstractions does not provide any sort of memory fence. | ||
|
||
#============================================================================== | ||
SINFUL IMPLEMENTATION WROUGHT ABOUT BY THE FOLLY OF MANKIND | ||
==============================================================================# | ||
|
||
# TODO: Overhaul this entirely once support becomes available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is merged, could you post an issue to our issue tracker explaining that this implementation can potentially be drastically simplified if Atomix/KernelAbstractions gain certain features (could you enumerate them?). If there is a corresponding issue on the Atomix and KernelAbstractions issue trackers, could you link to it? If there is no such issue on their issue trackers, could you create one?
Lastly, in that same issue, could you give a simple example of the type of bugs that canonicalize
might have if these fences were not used.
Would the implementation of these functions be simpler if you are permitted to assume the following: The memory behind the array behind the tableau is NOT pointed to by any other handle. I.e. there are never any subtableaux, there are never any views, there is no aliasing. Of course, that can not be guaranteed, but if the code would be simpler it makes a lot of sense to just say "undefined behavior occurs when canonicalizations are called on aliased data". |
I am aware of the fact that [1] Strictly speaking, multiple instantiations of the same function.
As slow or as fast as desired. The benchmark sizes, sample count, and iteration time limit are all exposed via constants defined in a single file.
Not really. The shenanigans are required because it is necessary to cooperatively decide on which pauli operator at which row ought to be next but there are far too many bits in the transaction for a single atomic to handle [2], as well as the fact that multiple blocks can potentially operate on the same row (depending on the batch size) so they need to make sure that they do not impact each other's work. There is an alternative work-around that does not require atomics but would necessitate allocating around [2] Not entirely true. Using bit manipulation, it is possible to pack all the data into |
Pondering this further, there is a mutex-free implementation using |
@Krastanov, any thoughts on the above proposition of |
I like your suggestion. Memory consumption is probably the least important constraint in how this library is usually used. |
|
Great news!
Yeah :) If you have the time to make small examples of issues with very succinctly described rationale (it is really important to be succinct and trivially easy to understand in these), the JuliaGPU maintainers would certainly appreciate such issue reports (maybe more of an aspiration "programming language theory" feature requests). More practically, in the context of this library: do not hesitate to pay a 20% performance penalty for code that is half the length (and much simpler) compared to the more performant version, especially if this will save you time. Simplicity (with documented issues and comments explaining that there might be a faster method) is of great value. |
…ntroduce benchmark and test scripts for canonicalization routines.
ce32a1c
to
e19e58a
Compare
Good news: The benchmark suite is now much more robust and dynamic in how it tackles long-running tasks. There's now a triggered threshold after which it will extrapolate from previous data rather than actually conduct any further runs, as well as a simple toggle to just perform Bad news: I just realised a major oversight in the implemented calls since my struggles with the JuliaGPU ecosystem made me forget that (Mixed)Destabilizers exist, the present routines will only modify the Stabilizer section but completely disregard updating the accompanying Destabilizer generators. Good news: The solution is simply one divide-and-conquer xor reduction before each Bad news: I will be travelling this week and may not find the time to implement said workaround. |
Thanks for the update. This sounds great! The schedule you are suggesting is perfectly reasonable, no worries about that. |
Apologies for the prolonged absence, a lot has been going this past month between the nightmare of hunting for suitable accommodation and the Sisyphean climb against administrative bureaucracy. Hopefully the situation will be clearing up come early October so I can finally get some clearance to continue working on this draft. |
No worries, @Hamiltonian-Action , your contributions are very valuable. Let me know if there is anything I can help with. |
Well this is a bit embarrassing, I had finished up work on
canonicalize
andcanonicalize_rref
on Friday but must have forgotten to push things out and set up this draft.There are lots of files that have been modified in this branch but most of them are merely structural and relate to code organisation, the actual meaningful changes are found in the following:
default_parameters.jl
enumerations.jl
andfixed_sizes.jl
mutex_configuration.jl
andmutex_management.jl
bit_manipulation.jl
scan_step.jl
snippets.jl
(namely, the new snippets in the catalogue)reductions.jl
(namely,reduce_lexicographic_min!
)canonicalize.jl
andcanonicalize_rref.jl
I believe the implementation to be reasonably performant, be work-efficient, as well as being quite flexible for potential extensibility. However, it is also a sinful blight upon this cruel and uncaring universe due to necessitating coding around some terrible ergonomics and missing features. Test files, benchmarks, and an implementation of
canonicalize_gott
will follow in a suitable time frame.Side note, the implementation provided can handle both left and right multiplication as well as X-first or Z-first ordering. Is there any consistent notion for what ought to be a
::T
argument and what ought to be a::Val{T}
parameter in terms of the public-facing API? Internally they can be whatever the implementation wishes but I would prefer giving this some structured thought so as to avoid having different functions accept different signatures -- e.g: themul
family takesphases::Val{B}
whereas thecanonicalize
functions only acceptphases::Bool
.