-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements #28136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back vpaprotski! A progress list of the required criteria for merging this PR into |
|
@vpaprotsk This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 1 new commit pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
|
@vpaprotsk The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
Nice speedup. This improvement seems worthy of a release note. |
|
/label add hotspot-compiler-dev |
|
@jatin-bhateja |
|
@ferakocz @ascarpino when you can spare some time, would appreciate a review (would like to get this into 26 if possible..) |
|
|
||
| void Assembler::vmovsldup(XMMRegister dst, XMMRegister src, int vector_len) { | ||
| assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : | ||
| (vector_len == AVX_256bit ? VM_Version::supports_avx2() : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector length 256 bit is supported by AVX=1.
|
|
||
| void Assembler::vmovshdup(XMMRegister dst, XMMRegister src, int vector_len) { | ||
| assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : | ||
| (vector_len == AVX_256bit ? VM_Version::supports_avx2() : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector length 256 bit is supported by AVX=1.
| // Element size (in bits) is specified by size parameter. | ||
| // size 0 and 1 are used for initial and final shuffles respectivelly of | ||
| // dilithiumAlmostInverseNtt and dilithiumAlmostNtt. | ||
| // NOTE: For size 0 and 1, input1[] and input2[] are modified in-place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the size-in-bits when size is 0 and 1? What is the difference between size 0 and size1? The overloading of size makes it confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size 0 seems to be doing a different shuffle than what is described in the diagram.
| // 0b-1-2-3-1 | ||
| __ vshufps(output2[i], input1[i], input2[i], 0b11011101, vector_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean this to be //0b-1-3-1-3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or 3-1-3-1.
mcpowers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to have @kuksenko or @ericcaspole look at MLDSABench.java.
|
|
||
| import java.lang.invoke.MethodHandle; | ||
| import java.lang.invoke.MethodHandles; | ||
| import java.lang.reflect.Field; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import statement
| import java.lang.invoke.MethodHandles; | ||
| import java.lang.reflect.Field; | ||
| import java.lang.reflect.Method; | ||
| import java.lang.reflect.Constructor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
| long seed, int i) throws Exception, Throwable { | ||
| int[] coeffs3 = new int[ML_DSA_N]; | ||
| for (int j = 0; j<ML_DSA_N; j++) { | ||
| coeffs3[j] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coeffs3 is written to but never read
| int[] prod4 = new int[ML_DSA_N]; | ||
| try { | ||
| for (int i = 0; i < repeat; i++) { | ||
| // seed = rnd.nextLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 lines commented out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was useful during development and might be useful hint for debugging; instead of deleting, added a comment. Let me know if that works
| -554416, 3919660, -48306, -1362209, 3937738, 1400424, -846154, 1976782 | ||
| }; | ||
| } | ||
| // java --add-opens java.base/sun.security.provider=ALL-UNNAMED -XX:+UseDilithiumIntrinsics test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is line is useful. Not sure I would hide it at the bottom of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant to delete it, but will move it to the top.
| @@ -0,0 +1,421 @@ | |||
| /* | |||
| * Copyright (c) 2015, 2018, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was some copy-paste! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor initial comments
| @@ -0,0 +1,517 @@ | |||
| /* | |||
| * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved. | |
| * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved. |
|
|
||
| for (int j = 0; j<ML_DSA_N; j++) { | ||
| coeffs1[j] = rnd.nextInt(); | ||
| coeffs2[j] = rnd.nextInt(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can uses generators for randome initialization of array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant this?
coeffs1 = rnd.ints(ML_DSA_N).toArray();
coeffs2 = rnd.ints(ML_DSA_N).toArray();
Didn't know about this, thanks. It does work..
But the original purpose (perhaps misguided, but its done) was to 'factor out' the allocations; the outer loop runs many million times (I've left it running for 6+hours during development) and so I wanted a 'somewhat efficient' test.
In hindsight, these (1k) arrays could probably be stack allocated, but I did not want to depend on an optimization when I could just write it without allocations in the mainline
| long seed = rnd.nextLong(); | ||
| rnd.setSeed(seed); | ||
| //Note: it might be useful to increase this number during development of new intrinsics | ||
| final int repeat = 10000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of high repetition count can you try tuning the tiered compilation threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the test is to test various (pseudo-random) values and compare the results to the java implementation of same code. A single run-though of the test doesn't always prove that there are no bugs.
A bit philosophical.. as is well known, when writing crypto, branches (conditional on secret) are disallowed; but e.g. carry propagation has the same 'conditional execution' effect. (Instead of "have you tested every branch direction" its "have you tested every carry") Besides a very careful range/overflow analysis (which I also did.. ntt functions skate very close to the int limit), exhaustive fuzz testing is the best method to find conditions that manual (range/overflow) analysis hasn't found; fuzz testing has very little math built in, so its also good at finding 'blind spots' I (and whomever has to review) might have not thought of..
| assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : | ||
| (vector_len == AVX_256bit ? VM_Version::supports_avx2() : | ||
| (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), ""); | ||
| InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)
But your assertions are nowhere checking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe those asserts are in vex_prefix_and_encode (
jdk/src/hotspot/cpu/x86/assembler_x86.cpp
Line 13164 in 6d3f779
| assert(((!is_extended) || (!attributes->is_legacy_mode())),"XMM register should be 0-15"); |
vex_prefix (jdk/src/hotspot/cpu/x86/assembler_x86.cpp
Line 13047 in 6d3f779
| assert((!is_extended || (!attributes->is_legacy_mode())),"XMM register should be 0-15"); |
I also haven't found any other instruction that does this check so I could emulate the style.
| assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : | ||
| (vector_len == AVX_256bit ? VM_Version::supports_avx2() : | ||
| (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), ""); | ||
| InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)
But your assertions are nowhere checking this.
| } | ||
|
|
||
| void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { | ||
| assert(VM_Version::supports_evex(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert(VM_Version::supports_evex(), ""); | |
| assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took the patch, but also kept the supports_evex() assert
| } | ||
|
|
||
| void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { | ||
| assert(VM_Version::supports_evex(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| MacroAssembler *_masm) { | ||
|
|
||
| static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen, | ||
| int vector_len, MacroAssembler *_masm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation corretness
|
What is the reason to add a new microbenchmark? org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair |
I can definitely remove it, got no strong attachment to it.. I did find it useful during development and thought it might be useful during review to verify performance.. but the usefulness of it beyond is indeed debatable. You might notice its a lot more 'granular'; it measures the performance of the intrinsics themselves, not the ("10-level deep") "wrappers". That said, those "wrappers" is what actual user will see and what we should be measuring. This new benchmark is only useful to another intrinsic developer.. (but it should already be usable by other platforms not just Intel?) |
I understand your reasons. The question is whether you'll need the microbenchmark in the future. If no (or probably no), please remove the micro. |
| // r1 in Quotient | ||
| // r1 = r1 & quotient; // copy 0 or keep as is, using EqMsk as filter | ||
| for (int i = 0; i < regCnt; i++) { | ||
| // FIXME: replace with void evmovdqul(Address dst, KRegister mask, XMMRegister src, bool merge, int vector_len);? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fixme a leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Removed. (I think I was considering merging this instruction with the storeXmm, but there really isnt a good way to do that)
sviswa7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@kuksenko I decided to just remove it. If anyone wants it back, its in my git history (I usually keep my branches after merge..) |
You could put a comment with the link into JBS issue to make it easier to discover later. (Or just attach the source file there.) |
|
Always faster and never slower: SignatureBench.MLDSA with There's nothing special about my benchmark. It's the one in OpenJDK (javax.crypto.full.SignatureBench). |
|
Good work! I just found a few typos in the comments. |
| // | B | | D | | ... | ||
| // +-----+-----+-----+-----+----- | ||
| // | ||
| // NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: respectivelly -> respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // the odd numbered slots of a third register. | ||
| // 2. Swap the even and odd numbered slots of the original input registers. | ||
| // 3. Similar to step 1, but into a different output register. | ||
| // the odd numbered slots of a scratch2 register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: scratch2 -> scratch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant "the scratch2" register here.. reworded, please double check if its clearer..
| // 2. Swap the even and odd numbered slots of the original input registers. | ||
| // 3. Similar to step 1, but into a different output register. | ||
| // the odd numbered slots of a scratch2 register. | ||
| // 2. Swap the even and odd numbered slots of the original input registers.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: unnecessary '*' at the end
| // 3. Similar to step 1, but into a different output register. | ||
| // the odd numbered slots of a scratch2 register. | ||
| // 2. Swap the even and odd numbered slots of the original input registers.* | ||
| // 3. Similar to step 1, but into output register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: into output register -> into an output register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?
| // registers indexed by the numbers in inputRegs2 will contain the same number, | ||
| // this should be indicated by calling this function with | ||
| // input2NeedsShuffle=false . | ||
| // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: unnecessary '(*' at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..
| // input in two stages. First half, load 8 registers 32 integers each apart. | ||
| // With one load, we can process level 0-2 (128-, 64- and 32-integers apart) | ||
| // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-, | ||
| // 2-, 1-integer appart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appart -> apart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks like I've always misspelled that word! :)
| // With one load, we can process level 0-2 (128-, 64- and 32-integers apart) | ||
| // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-, | ||
| // 2-, 1-integer appart) | ||
| // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appart -> apart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers | ||
| // Other levels, shuffles can be done by re-aranging register order | ||
|
|
||
| // Four batches of 8 registers each, 128 bytes appart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appart -> apart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| for (int i = 0; i < 8; i++) { | ||
| __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit); | ||
| // Four batches of 8 registers each, 128 bytes appart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appart -> apart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // the substartion is (Montgomery) multiplied by the corresponding zetas. | ||
| // In each level we just collect the coefficients (using evpermi2d() | ||
| // instructions where necessary, i.e. on levels 0-4) so that the results of | ||
| // the substration is (Montgomery) multiplied by the corresponding zetas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substration -> subtraction (I know this was in my own comment :-( )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (funny, thats exactly how I say "substraction" in my head too :D )
Need bit of clarification.. (I think you are saying there is a regression?).
In my measurements for AVX512 parts, I had seen between 0%->6% across That is also why I had included the other (deleted) microbenchmark.. Hence the claim 'never worse'.. A more precise claim..: |
|
The 2.24% improvement is the difference between My baseline is simply a build without your changes. I compared this with a build containing your changes and see a 2.24% improvement. Verification showed the least amount of improvement (same as what you observed). "never worse" is just my way of saying "always faster". |
vpaprotsk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // | B | | D | | ... | ||
| // +-----+-----+-----+-----+----- | ||
| // | ||
| // NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // the odd numbered slots of a third register. | ||
| // 2. Swap the even and odd numbered slots of the original input registers. | ||
| // 3. Similar to step 1, but into a different output register. | ||
| // the odd numbered slots of a scratch2 register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant "the scratch2" register here.. reworded, please double check if its clearer..
| // registers indexed by the numbers in inputRegs2 will contain the same number, | ||
| // this should be indicated by calling this function with | ||
| // input2NeedsShuffle=false . | ||
| // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..
| // 3. Similar to step 1, but into a different output register. | ||
| // the odd numbered slots of a scratch2 register. | ||
| // 2. Swap the even and odd numbered slots of the original input registers.* | ||
| // 3. Similar to step 1, but into output register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?
| // If so, use output: | ||
| const XMMRegister* scratch = scratch1 == input1 ? output: scratch1; | ||
|
|
||
| // scratch = input1_even*intput2_even |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // With one load, we can process level 0-2 (128-, 64- and 32-integers apart) | ||
| // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-, | ||
| // 2-, 1-integer appart) | ||
| // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-, | ||
| // 2-, 1-integer appart) | ||
| // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers | ||
| // Other levels, shuffles can be done by re-aranging register order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers | ||
| // Other levels, shuffles can be done by re-aranging register order | ||
|
|
||
| // Four batches of 8 registers each, 128 bytes appart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // the substartion is (Montgomery) multiplied by the corresponding zetas. | ||
| // In each level we just collect the coefficients (using evpermi2d() | ||
| // instructions where necessary, i.e. on levels 0-4) so that the results of | ||
| // the substration is (Montgomery) multiplied by the corresponding zetas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (funny, thats exactly how I say "substraction" in my head too :D )
|
|
||
| for (int i = 0; i < 8; i++) { | ||
| __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit); | ||
| // Four batches of 8 registers each, 128 bytes appart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work @vpaprotsk ,
Please also add in comments the links to original reference implimentation of stub.
| Assembler::AVX_512bit, scratch); // 2^64 mod q | ||
| vector_len, scratch); // 2^64 mod q | ||
| if (vector_len == Assembler::AVX_512bit) { | ||
| __ mov64(scratch, 0b0101010101010101); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add long constant suffix
| } else { | ||
| __ evpbroadcastd(constant, rConstant, Assembler::AVX_512bit); // constant multiplier | ||
|
|
||
| __ mov64(scratch, 0b0101010101010101); //dw-mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant suffix
| } | ||
| } | ||
| static void loadXmms(const XMMRegister destinationRegs[], Register source, int offset, | ||
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { | |
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { |
| __ evmovdqul(Address(destination, offset + i * XMMBYTES), xmm(xmmRegs[i]), | ||
| Assembler::AVX_512bit); | ||
| static void storeXmms(Register destination, int offset, const XMMRegister xmmRegs[], | ||
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { | |
| int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) { |
|
|
||
| // zetas (int[128*8]) = c_rarg1 | ||
| static address generate_dilithiumAlmostInverseNtt_avx(StubGenerator *stubgen, | ||
| int vector_len,MacroAssembler *_masm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation
| static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen, | ||
| MacroAssembler *_masm) { | ||
| static address generate_dilithiumNttMult_avx(StubGenerator *stubgen, | ||
| int vector_len, MacroAssembler *_masm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation
| const XMMRegister Coeffs4_2[] = {xmm5, xmm21, xmm7, xmm23}; | ||
|
|
||
| // Constants for shuffle and montMul64 | ||
| __ mov64(scratch, 0b1010101010101010); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 bit constant suffix
| static address generate_dilithiumMontMulByConstant_avx512(StubGenerator *stubgen, | ||
| MacroAssembler *_masm) { | ||
| static address generate_dilithiumMontMulByConstant_avx(StubGenerator *stubgen, | ||
| int vector_len, MacroAssembler *_masm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation
SignatureBench.MLDSAis 1.2x-2.2x fasterSignatureBench.MLDSAis upto 5% faster, never slowerNote on intrinsic:
Tests and benchmarks:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28136/head:pull/28136$ git checkout pull/28136Update a local copy of the PR:
$ git checkout pull/28136$ git pull https://git.openjdk.org/jdk.git pull/28136/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28136View PR using the GUI difftool:
$ git pr show -t 28136Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28136.diff
Using Webrev
Link to Webrev Comment