8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements #28136

vpaprotsk · 2025-11-04T16:38:49Z

New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
- SignatureBench.MLDSA is 1.2x-2.2x faster
- Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed AVX2 and AVX512 intrinsics for SHA3 vpaprotsk/jdk#7)
AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
- SignatureBench.MLDSA is upto 5% faster, never slower

Note on intrinsic:

The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2

Tests and benchmarks:

Added a fuzz test to ensure Java and intrinsic produces exactly same result
Added benchmark to measure the performance of intrinsic itself

make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements (Enhancement - P4)

Reviewers

Mark Powers (@mcpowers - Committer) Review applies to 6d3f7794
Sandhya Viswanathan (@sviswa7 - Reviewer)
Anthony Scarpino (@ascarpino - Reviewer) Review applies to b04f4f0d

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28136/head:pull/28136
$ git checkout pull/28136

Update a local copy of the PR:
$ git checkout pull/28136
$ git pull https://git.openjdk.org/jdk.git pull/28136/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28136

View PR using the GUI difftool:
$ git pr show -t 28136

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28136.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-11-04T16:39:56Z

👋 Welcome back vpaprotski! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-11-04T16:41:37Z

@vpaprotsk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements

Reviewed-by: sviswanathan, mpowers, ascarpino

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the master branch:

507a6d3: 8368001: java/text/Format/NumberFormat/NumberRoundTrip.java timed out

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-11-04T16:42:56Z

@vpaprotsk The following labels will be automatically applied to this pull request:

hotspot
security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-11-04T16:46:35Z

Webrevs

seanjmullan · 2025-11-05T17:47:35Z

Nice speedup. This improvement seems worthy of a release note.

jatin-bhateja · 2025-11-07T07:21:31Z

/label add hotspot-compiler-dev

openjdk · 2025-11-07T07:22:31Z

@jatin-bhateja
The hotspot-compiler label was successfully added.

vpaprotsk · 2025-11-13T19:37:04Z

@ferakocz @ascarpino when you can spare some time, would appreciate a review (would like to get this into 26 if possible..)

sviswa7 · 2025-11-14T17:07:53Z

src/hotspot/cpu/x86/assembler_x86.cpp


+void Assembler::vmovsldup(XMMRegister dst, XMMRegister src, int vector_len) {
+    assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
+            (vector_len == AVX_256bit ? VM_Version::supports_avx2() :


Vector length 256 bit is supported by AVX=1.

sviswa7 · 2025-11-14T17:11:28Z

src/hotspot/cpu/x86/assembler_x86.cpp

+
+void Assembler::vmovshdup(XMMRegister dst, XMMRegister src, int vector_len) {
+  assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
+          (vector_len == AVX_256bit ? VM_Version::supports_avx2() :


Vector length 256 bit is supported by AVX=1.

sviswa7 · 2025-11-14T17:51:24Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+// Element size (in bits) is specified by size parameter.
+// size 0 and 1 are used for initial and final shuffles respectivelly of
+// dilithiumAlmostInverseNtt and dilithiumAlmostNtt.
+// NOTE: For size 0 and 1, input1[] and input2[] are modified in-place


what is the size-in-bits when size is 0 and 1? What is the difference between size 0 and size1? The overloading of size makes it confusing.

size 0 seems to be doing a different shuffle than what is described in the diagram.

sviswa7 · 2025-11-14T19:03:36Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+            // 0b-1-2-3-1
+            __ vshufps(output2[i], input1[i], input2[i], 0b11011101, vector_len);


Did you mean this to be //0b-1-3-1-3?

or 3-1-3-1.

mcpowers

You might want to have @kuksenko or @ericcaspole look at MLDSABench.java.

mcpowers · 2025-11-16T16:38:02Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+
+import java.lang.invoke.MethodHandle;
+import java.lang.invoke.MethodHandles;
+import java.lang.reflect.Field;


unused import statement

mcpowers · 2025-11-16T16:39:12Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+import java.lang.invoke.MethodHandles;
+import java.lang.reflect.Field;
+import java.lang.reflect.Method;
+import java.lang.reflect.Constructor;


unused import

mcpowers · 2025-11-16T16:45:50Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+        long seed, int i) throws Exception, Throwable {
+        int[] coeffs3 = new int[ML_DSA_N];
+        for (int j = 0; j<ML_DSA_N; j++) {
+            coeffs3[j] =


coeffs3 is written to but never read

mcpowers · 2025-11-16T16:47:29Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+        int[] prod4 = new int[ML_DSA_N];
+        try {
+            for (int i = 0; i < repeat; i++) {
+                // seed = rnd.nextLong();


2 lines commented out

This was useful during development and might be useful hint for debugging; instead of deleting, added a comment. Let me know if that works

mcpowers · 2025-11-16T16:52:07Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+            -554416, 3919660, -48306, -1362209, 3937738, 1400424, -846154, 1976782
+    };
+}
+// java --add-opens java.base/sun.security.provider=ALL-UNNAMED  -XX:+UseDilithiumIntrinsics test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java


This is line is useful. Not sure I would hide it at the bottom of the file.

I actually meant to delete it, but will move it to the top.

mcpowers · 2025-11-16T16:53:19Z

test/micro/org/openjdk/bench/javax/crypto/full/MLDSABench.java

@@ -0,0 +1,421 @@
+/*
+ * Copyright (c) 2015, 2018, Oracle and/or its affiliates. All rights reserved.


Copyright date.

That was some copy-paste! Thanks

jatin-bhateja

Minor initial comments

jatin-bhateja · 2025-11-17T06:31:59Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

@@ -0,0 +1,517 @@
+/*
+ * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.


Suggested change

* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.

* Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.

jatin-bhateja · 2025-11-17T06:34:01Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+
+        for (int j = 0; j<ML_DSA_N; j++) {
+            coeffs1[j] = rnd.nextInt();
+            coeffs2[j] = rnd.nextInt();
+        }


You can uses generators for randome initialization of array

I think you meant this?

coeffs1 = rnd.ints(ML_DSA_N).toArray(); coeffs2 = rnd.ints(ML_DSA_N).toArray();

Didn't know about this, thanks. It does work..

But the original purpose (perhaps misguided, but its done) was to 'factor out' the allocations; the outer loop runs many million times (I've left it running for 6+hours during development) and so I wanted a 'somewhat efficient' test.

In hindsight, these (1k) arrays could probably be stack allocated, but I did not want to depend on an optimization when I could just write it without allocations in the mainline

jatin-bhateja · 2025-11-17T06:35:52Z

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

+        long seed = rnd.nextLong();
+        rnd.setSeed(seed);
+        //Note: it might be useful to increase this number during development of new intrinsics
+        final int repeat = 10000000;


Instead of high repetition count can you try tuning the tiered compilation threshold.

The purpose of the test is to test various (pseudo-random) values and compare the results to the java implementation of same code. A single run-though of the test doesn't always prove that there are no bugs.

A bit philosophical.. as is well known, when writing crypto, branches (conditional on secret) are disallowed; but e.g. carry propagation has the same 'conditional execution' effect. (Instead of "have you tested every branch direction" its "have you tested every carry") Besides a very careful range/overflow analysis (which I also did.. ntt functions skate very close to the int limit), exhaustive fuzz testing is the best method to find conditions that manual (range/overflow) analysis hasn't found; fuzz testing has very little math built in, so its also good at finding 'blind spots' I (and whomever has to review) might have not thought of..

jatin-bhateja · 2025-11-17T06:44:39Z

src/hotspot/cpu/x86/assembler_x86.cpp

+    assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
+            (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
+            (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
+  InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);


When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)

But your assertions are nowhere checking this.

I believe those asserts are in vex_prefix_and_encode (

jdk/src/hotspot/cpu/x86/assembler_x86.cpp

Line 13164 in 6d3f779

assert(((!is_extended) || (!attributes->is_legacy_mode())),"XMM register should be 0-15");

) and vex_prefix (

jdk/src/hotspot/cpu/x86/assembler_x86.cpp

Line 13047 in 6d3f779

assert((!is_extended || (!attributes->is_legacy_mode())),"XMM register should be 0-15");

)

I also haven't found any other instruction that does this check so I could emulate the style.

jatin-bhateja · 2025-11-17T06:44:55Z

src/hotspot/cpu/x86/assembler_x86.cpp

+  assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
+          (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
+          (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
+  InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);


When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)

But your assertions are nowhere checking this.

jatin-bhateja · 2025-11-17T06:47:34Z

src/hotspot/cpu/x86/assembler_x86.cpp

+}
+
+void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
+  assert(VM_Version::supports_evex(), "");


Suggested change

assert(VM_Version::supports_evex(), "");

assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");

Took the patch, but also kept the supports_evex() assert

jatin-bhateja · 2025-11-17T06:48:20Z

src/hotspot/cpu/x86/assembler_x86.cpp

+}
+
+void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
+  assert(VM_Version::supports_evex(), "");


Same as above

jatin-bhateja · 2025-11-17T06:53:01Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-                                                  MacroAssembler *_masm) {
-
+static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen,
+                            int vector_len, MacroAssembler *_masm) {


Indentation corretness

kuksenko · 2025-11-19T21:51:32Z

What is the reason to add a new microbenchmark?
We already have enough micros covering MLDSA:

org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair
org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign
org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify
org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair
org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign
org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

vpaprotsk · 2025-11-19T21:59:03Z

What is the reason to add a new microbenchmark? We already have enough micros covering MLDSA:

org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

I can definitely remove it, got no strong attachment to it.. I did find it useful during development and thought it might be useful during review to verify performance.. but the usefulness of it beyond is indeed debatable.

You might notice its a lot more 'granular'; it measures the performance of the intrinsics themselves, not the ("10-level deep") "wrappers". That said, those "wrappers" is what actual user will see and what we should be measuring.

This new benchmark is only useful to another intrinsic developer.. (but it should already be usable by other platforms not just Intel?)

kuksenko · 2025-11-19T22:40:41Z

What is the reason to add a new microbenchmark? We already have enough micros covering MLDSA:
org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

I can definitely remove it, got no strong attachment to it.. I did find it useful during development and thought it might be useful during review to verify performance.. but the usefulness of it beyond is indeed debatable.

You might notice its a lot more 'granular'; it measures the performance of the intrinsics themselves, not the ("10-level deep") "wrappers". That said, those "wrappers" is what actual user will see and what we should be measuring.

This new benchmark is only useful to another intrinsic developer.. (but it should already be usable by other platforms not just Intel?)

I understand your reasons. The question is whether you'll need the microbenchmark in the future. If no (or probably no), please remove the micro.
If needed, please move it from the "org.openjdk.bench.javax.crypto.full" package to "org.openjdk.bench.javax.crypto". It is supposed to have only public API micros in packages "small" and "full"

sviswa7 · 2025-11-20T21:31:31Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // r1 in Quotient
+    // r1 = r1 & quotient; // copy 0 or keep as is, using EqMsk as filter
+    for (int i = 0; i < regCnt; i++) {
+      // FIXME: replace with void evmovdqul(Address dst, KRegister mask, XMMRegister src, bool merge, int vector_len);?


Is the fixme a leftover?

Yes. Removed. (I think I was considering merging this instruction with the storeXmm, but there really isnt a good way to do that)

sviswa7

Looks good to me.

vpaprotsk · 2025-11-20T23:13:41Z

I understand your reasons. The question is whether you'll need the microbenchmark in the future. If no (or probably no), please remove the micro. If needed, please move it from the "org.openjdk.bench.javax.crypto.full" package to "org.openjdk.bench.javax.crypto". It is supposed to have only public API micros in packages "small" and "full"

@kuksenko I decided to just remove it. If anyone wants it back, its in my git history (I usually keep my branches after merge..)

iwanowww · 2025-11-20T23:39:05Z

If anyone wants it back, its in my git history (I usually keep my branches after merge..)

You could put a comment with the link into JBS issue to make it easier to discover later. (Or just attach the source file there.)

vpaprotsk · 2025-11-21T17:15:04Z

@iwanowww thanks for the suggestion! attached to JBS.

@mcpowers would you mind running your internal test suite for this PR? I am thinking of integrating early next week, if no objections; getting close to the release deadline, dont want to cut it even closer..

mcpowers · 2025-11-24T16:28:44Z

Always faster and never slower:

SignatureBench.MLDSA with +UseDilithiumIntrinsics shows an average 1.61% improvement across all algorithms and data sizes.
Measuring SignatureBench.MLDSA against a baseline build without the fix, shows an average 2.24% improvement across all algorithms and data sizes.

There's nothing special about my benchmark. It's the one in OpenJDK (javax.crypto.full.SignatureBench).

Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

ferakocz · 2025-11-24T16:39:00Z

Good work! I just found a few typos in the comments.

ferakocz · 2025-11-24T15:35:12Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+// |  B  |     |  D  |     | ...
+// +-----+-----+-----+-----+-----
+//
+// NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of


Typo: respectivelly -> respectively

ferakocz · 2025-11-24T15:50:30Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-//    the odd numbered slots of a third register.
-// 2. Swap the even and odd numbered slots of the original input registers.
-// 3. Similar to step 1, but into a different output register.
+//    the odd numbered slots of a scratch2 register.


Typo: scratch2 -> scratch

I think I meant "the scratch2" register here.. reworded, please double check if its clearer..

ferakocz · 2025-11-24T15:53:36Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-// 2. Swap the even and odd numbered slots of the original input registers.
-// 3. Similar to step 1, but into a different output register.
+//    the odd numbered slots of a scratch2 register.
+// 2. Swap the even and odd numbered slots of the original input registers.*


Typo: unnecessary '*' at the end

ferakocz · 2025-11-24T15:54:42Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-// 3. Similar to step 1, but into a different output register.
+//    the odd numbered slots of a scratch2 register.
+// 2. Swap the even and odd numbered slots of the original input registers.*
+// 3. Similar to step 1, but into output register.


Typo: into output register -> into an output register

used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?

ferakocz · 2025-11-24T15:56:29Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-// registers indexed by the numbers in inputRegs2 will contain the same number,
-// this should be indicated by calling this function with
-// input2NeedsShuffle=false .
+// (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap


Typo: unnecessary '(*' at the beginning

This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..

ferakocz · 2025-11-24T16:25:59Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // input in two stages. First half, load 8 registers 32 integers each apart.
+    // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
+    // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
+    // 2-, 1-integer appart)


appart -> apart

Thanks! Looks like I've always misspelled that word! :)

ferakocz · 2025-11-24T16:26:37Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
+    // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
+    // 2-, 1-integer appart)
+    // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers


appart -> apart

ferakocz · 2025-11-24T16:27:17Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
+    // Other levels, shuffles can be done by re-aranging register order
+
+    // Four batches of 8 registers each, 128 bytes appart


appart -> apart

ferakocz · 2025-11-24T16:28:21Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp


-  for (int i = 0; i < 8; i++) {
-    __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
+    // Four batches of 8 registers each, 128 bytes appart


appart -> apart

ferakocz · 2025-11-24T16:32:12Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-  // the substartion is (Montgomery) multiplied by the corresponding zetas.
-  // In each level we just collect the coefficients (using evpermi2d()
-  // instructions where necessary, i.e. on levels 0-4) so that the results of
+  // the substration is (Montgomery) multiplied by the corresponding zetas.


substration -> subtraction (I know this was in my own comment :-( )

done (funny, thats exactly how I say "substraction" in my head too :D )

vpaprotsk · 2025-11-24T17:15:46Z

SignatureBench.MLDSA with +UseDilithiumIntrinsics shows an average 1.61% improvement across all algorithms and data sizes. Measuring SignatureBench.MLDSA against a baseline build without the fix, shows an average 2.24% improvement across all algorithms and data sizes.

Need bit of clarification.. (I think you are saying there is a regression?).

+UseDilithiumIntrinsics should be redundant (i.e. vm_version_x86.cpp should automatically detect and turn the feature on).
- So if I read correctly.. the baseline measured is already has the original intrinsics (implicitly) enabled..
  - therefore there is a 2.24% noise in the benchmark?

In my measurements for AVX512 parts, I had seen between 0%->6% across SignatureBench.MLDSA
- (some variation on desktop-vs-server parts..)
- SignatureBench.MLDSA.verify was worse, only 0->2% depending on keysize (iirc, bigger portion of benchmark was in SHA3 instead)
- SignatureBench.MLDSA.sign was better, 4-6% (also depending on datasize)

That is also why I had included the other (deleted) microbenchmark.. SignatureBench.MLDSA has a lot of 'other things' (e.g. SHA3) also happening, so the AVX512 intrinsic changes were harder to differentiate from noise..
- I had measured ~25%-50% improvement on purely the 5 intrinsics changed..

Hence the claim 'never worse'.. A more precise claim..:
- "New intrinsics seem to be better, but (at least for AVX512) existing intrinsics were already plenty good for MLDSA"

mcpowers · 2025-11-24T17:54:31Z

The 2.24% improvement is the difference between +UseDilithiumIntrinsics and -UseDilithiumIntrinsics. I just repeated the testing that you documented in the description section of this PR on a different machine.

My baseline is simply a build without your changes. I compared this with a build containing your changes and see a 2.24% improvement.

Verification showed the least amount of improvement (same as what you observed).

"never worse" is just my way of saying "always faster".

vpaprotsk

@mcpowers Thanks for tests!

@ferakocz thanks for the review! I think I took them all in, except for the montMul comment section.. Not quite what I meant so tried to reword.. see if it helps any?

vpaprotsk · 2025-11-24T19:54:55Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+// |  B  |     |  D  |     | ...
+// +-----+-----+-----+-----+-----
+//
+// NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of


vpaprotsk · 2025-11-24T20:04:10Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-//    the odd numbered slots of a third register.
-// 2. Swap the even and odd numbered slots of the original input registers.
-// 3. Similar to step 1, but into a different output register.
+//    the odd numbered slots of a scratch2 register.


I think I meant "the scratch2" register here.. reworded, please double check if its clearer..

vpaprotsk · 2025-11-24T20:06:17Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-// registers indexed by the numbers in inputRegs2 will contain the same number,
-// this should be indicated by calling this function with
-// input2NeedsShuffle=false .
+// (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap


This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..

vpaprotsk · 2025-11-24T20:09:13Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-// 3. Similar to step 1, but into a different output register.
+//    the odd numbered slots of a scratch2 register.
+// 2. Swap the even and odd numbered slots of the original input registers.*
+// 3. Similar to step 1, but into output register.


used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?

vpaprotsk · 2025-11-24T20:10:40Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // If so, use output:
+    const XMMRegister* scratch = scratch1 == input1 ? output: scratch1;
+
+    // scratch = input1_even*intput2_even


vpaprotsk · 2025-11-24T20:18:26Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
+    // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
+    // 2-, 1-integer appart)
+    // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers


vpaprotsk · 2025-11-24T20:18:33Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
+    // 2-, 1-integer appart)
+    // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
+    // Other levels, shuffles can be done by re-aranging register order


vpaprotsk · 2025-11-24T20:20:06Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
+    // Other levels, shuffles can be done by re-aranging register order
+
+    // Four batches of 8 registers each, 128 bytes appart


vpaprotsk · 2025-11-24T20:21:58Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-  // the substartion is (Montgomery) multiplied by the corresponding zetas.
-  // In each level we just collect the coefficients (using evpermi2d()
-  // instructions where necessary, i.e. on levels 0-4) so that the results of
+  // the substration is (Montgomery) multiplied by the corresponding zetas.


done (funny, thats exactly how I say "substraction" in my head too :D )

vpaprotsk · 2025-11-24T20:22:13Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp


-  for (int i = 0; i < 8; i++) {
-    __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
+    // Four batches of 8 registers each, 128 bytes appart


jatin-bhateja

Very nice work @vpaprotsk ,
Please also add in comments the links to original reference implimentation of stub.

jatin-bhateja · 2025-11-25T02:44:35Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-                  Assembler::AVX_512bit, scratch); // 2^64 mod q
+                  vector_len, scratch); // 2^64 mod q
+  if (vector_len == Assembler::AVX_512bit) {
+    __ mov64(scratch, 0b0101010101010101);


Add long constant suffix

jatin-bhateja · 2025-11-25T02:44:59Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+  } else {
+    __ evpbroadcastd(constant, rConstant, Assembler::AVX_512bit); // constant multiplier
+
+    __ mov64(scratch, 0b0101010101010101); //dw-mask


Constant suffix

jatin-bhateja · 2025-11-25T02:50:41Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-    }
-}
+static void loadXmms(const XMMRegister destinationRegs[], Register source, int offset,
+                       int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {


Suggested change

int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

jatin-bhateja · 2025-11-25T02:51:02Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-    __ evmovdqul(Address(destination, offset + i * XMMBYTES), xmm(xmmRegs[i]),
-                 Assembler::AVX_512bit);
+static void storeXmms(Register destination, int offset, const XMMRegister xmmRegs[],
+                       int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {


Suggested change

int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

jatin-bhateja · 2025-11-25T02:53:56Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-
+// zetas (int[128*8]) = c_rarg1
+static address generate_dilithiumAlmostInverseNtt_avx(StubGenerator *stubgen,
+                                         int vector_len,MacroAssembler *_masm) {


Fix indentation

jatin-bhateja · 2025-11-25T02:55:59Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen,
-                                                MacroAssembler *_masm) {
+static address generate_dilithiumNttMult_avx(StubGenerator *stubgen,
+                                     int vector_len, MacroAssembler *_masm) {


Fix indentation

jatin-bhateja · 2025-11-25T02:56:20Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

+    const XMMRegister Coeffs4_2[] = {xmm5, xmm21, xmm7, xmm23};
+
+    // Constants for shuffle and montMul64
+    __ mov64(scratch, 0b1010101010101010);


64 bit constant suffix

jatin-bhateja · 2025-11-25T02:56:50Z

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp

-static address generate_dilithiumMontMulByConstant_avx512(StubGenerator *stubgen,
-                                                          MacroAssembler *_masm) {
+static address generate_dilithiumMontMulByConstant_avx(StubGenerator *stubgen,
+                                        int vector_len, MacroAssembler *_masm) {


Fix indentation

vpaprotsk added 4 commits July 9, 2025 02:58

AVX2 and AVX512 intrinsics for MLDSA

35841c7

Fixes and comments from Anas

2ff3b82

add copyright, whitespace and test jtreg tags

f4f84b6

Merge remote-tracking branch 'origin/master' into avx2-ntt

6d3f779

openjdk bot added security security-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Nov 4, 2025

openjdk bot added the rfr Pull request is ready for review label Nov 4, 2025

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 7, 2025

sviswa7 reviewed Nov 14, 2025

View reviewed changes

mcpowers approved these changes Nov 16, 2025

View reviewed changes

jatin-bhateja reviewed Nov 17, 2025

View reviewed changes

openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Nov 17, 2025

vpaprotsk added 2 commits November 17, 2025 23:33

address first comments

050312f

whitespace

e913340

sviswa7 reviewed Nov 20, 2025

View reviewed changes

next set of comments

b04f4f0

sviswa7 approved these changes Nov 20, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Nov 20, 2025

ferakocz reviewed Nov 24, 2025

View reviewed changes

vpaprotsk commented Nov 24, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into avx2-ntt

cefa021

ascarpino approved these changes Nov 24, 2025

View reviewed changes

openjdk bot removed the ready Pull request is ready to be integrated label Nov 24, 2025

comments from Ferenc

691e1df

sviswa7 approved these changes Nov 24, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Nov 24, 2025

spelling

bfc16f1

jatin-bhateja reviewed Nov 25, 2025

View reviewed changes

		// 0b-1-2-3-1
		__ vshufps(output2[i], input1[i], input2[i], 0b11011101, vector_len);

		@@ -0,0 +1,421 @@
		/*
		* Copyright (c) 2015, 2018, Oracle and/or its affiliates. All rights reserved.

		@@ -0,0 +1,517 @@
		/*
		* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.

	* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.
	* Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.

	assert(VM_Version::supports_evex(), "");
	assert(vector_len == AVX_512 \|\| VM_Version::supports_avx512vl), "");

	int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
	int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements #28136

Are you sure you want to change the base?

8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements #28136

Conversation

vpaprotsk commented Nov 4, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Nov 4, 2025

Uh oh!

openjdk bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

seanjmullan commented Nov 5, 2025

Uh oh!

jatin-bhateja commented Nov 7, 2025

Uh oh!

openjdk bot commented Nov 7, 2025

Uh oh!

vpaprotsk commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcpowers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vpaprotsk commented Nov 4, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Nov 4, 2025 •

edited

Loading

openjdk bot commented Nov 4, 2025 •

edited

Loading

mlbridge bot commented Nov 4, 2025 •

edited

Loading

jatin-bhateja left a comment •

edited

Loading

kuksenko commented Nov 19, 2025 •

edited

Loading

kuksenko commented Nov 19, 2025 •

edited

Loading

iwanowww commented Nov 20, 2025 •

edited

Loading