Skip to content

Conversation

@vpaprotsk
Copy link
Contributor

@vpaprotsk vpaprotsk commented Nov 4, 2025

  • New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
  • AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
    • SignatureBench.MLDSA is upto 5% faster, never slower

Note on intrinsic:

  • The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
  • Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2

Tests and benchmarks:

  • Added a fuzz test to ensure Java and intrinsic produces exactly same result
  • Added benchmark to measure the performance of intrinsic itself
make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28136/head:pull/28136
$ git checkout pull/28136

Update a local copy of the PR:
$ git checkout pull/28136
$ git pull https://git.openjdk.org/jdk.git pull/28136/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28136

View PR using the GUI difftool:
$ git pr show -t 28136

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28136.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 4, 2025

👋 Welcome back vpaprotski! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Nov 4, 2025

@vpaprotsk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements

Reviewed-by: sviswanathan, mpowers, ascarpino

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the master branch:

  • 507a6d3: 8368001: java/text/Format/NumberFormat/NumberRoundTrip.java timed out

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added security security-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Nov 4, 2025
@openjdk
Copy link

openjdk bot commented Nov 4, 2025

@vpaprotsk The following labels will be automatically applied to this pull request:

  • hotspot
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 4, 2025
@mlbridge
Copy link

mlbridge bot commented Nov 4, 2025

Webrevs

@seanjmullan
Copy link
Member

Nice speedup. This improvement seems worthy of a release note.

@jatin-bhateja
Copy link
Member

/label add hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 7, 2025
@openjdk
Copy link

openjdk bot commented Nov 7, 2025

@jatin-bhateja
The hotspot-compiler label was successfully added.

@vpaprotsk
Copy link
Contributor Author

@ferakocz @ascarpino when you can spare some time, would appreciate a review (would like to get this into 26 if possible..)


void Assembler::vmovsldup(XMMRegister dst, XMMRegister src, int vector_len) {
assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
(vector_len == AVX_256bit ? VM_Version::supports_avx2() :
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector length 256 bit is supported by AVX=1.


void Assembler::vmovshdup(XMMRegister dst, XMMRegister src, int vector_len) {
assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
(vector_len == AVX_256bit ? VM_Version::supports_avx2() :
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector length 256 bit is supported by AVX=1.

// Element size (in bits) is specified by size parameter.
// size 0 and 1 are used for initial and final shuffles respectivelly of
// dilithiumAlmostInverseNtt and dilithiumAlmostNtt.
// NOTE: For size 0 and 1, input1[] and input2[] are modified in-place
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the size-in-bits when size is 0 and 1? What is the difference between size 0 and size1? The overloading of size makes it confusing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size 0 seems to be doing a different shuffle than what is described in the diagram.

Comment on lines 136 to 137
// 0b-1-2-3-1
__ vshufps(output2[i], input1[i], input2[i], 0b11011101, vector_len);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean this to be //0b-1-3-1-3?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or 3-1-3-1.

Copy link
Contributor

@mcpowers mcpowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to have @kuksenko or @ericcaspole look at MLDSABench.java.


import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.reflect.Field;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import statement

import java.lang.invoke.MethodHandles;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.lang.reflect.Constructor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

long seed, int i) throws Exception, Throwable {
int[] coeffs3 = new int[ML_DSA_N];
for (int j = 0; j<ML_DSA_N; j++) {
coeffs3[j] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coeffs3 is written to but never read

int[] prod4 = new int[ML_DSA_N];
try {
for (int i = 0; i < repeat; i++) {
// seed = rnd.nextLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 lines commented out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was useful during development and might be useful hint for debugging; instead of deleting, added a comment. Let me know if that works

-554416, 3919660, -48306, -1362209, 3937738, 1400424, -846154, 1976782
};
}
// java --add-opens java.base/sun.security.provider=ALL-UNNAMED -XX:+UseDilithiumIntrinsics test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is line is useful. Not sure I would hide it at the bottom of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to delete it, but will move it to the top.

@@ -0,0 +1,421 @@
/*
* Copyright (c) 2015, 2018, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was some copy-paste! Thanks

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor initial comments

@@ -0,0 +1,517 @@
/*
* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.

Comment on lines +141 to +145

for (int j = 0; j<ML_DSA_N; j++) {
coeffs1[j] = rnd.nextInt();
coeffs2[j] = rnd.nextInt();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can uses generators for randome initialization of array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant this?

        coeffs1 = rnd.ints(ML_DSA_N).toArray();
        coeffs2 = rnd.ints(ML_DSA_N).toArray();

Didn't know about this, thanks. It does work..

But the original purpose (perhaps misguided, but its done) was to 'factor out' the allocations; the outer loop runs many million times (I've left it running for 6+hours during development) and so I wanted a 'somewhat efficient' test.

In hindsight, these (1k) arrays could probably be stack allocated, but I did not want to depend on an optimization when I could just write it without allocations in the mainline

long seed = rnd.nextLong();
rnd.setSeed(seed);
//Note: it might be useful to increase this number during development of new intrinsics
final int repeat = 10000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of high repetition count can you try tuning the tiered compilation threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the test is to test various (pseudo-random) values and compare the results to the java implementation of same code. A single run-though of the test doesn't always prove that there are no bugs.

A bit philosophical.. as is well known, when writing crypto, branches (conditional on secret) are disallowed; but e.g. carry propagation has the same 'conditional execution' effect. (Instead of "have you tested every branch direction" its "have you tested every carry") Besides a very careful range/overflow analysis (which I also did.. ntt functions skate very close to the int limit), exhaustive fuzz testing is the best method to find conditions that manual (range/overflow) analysis hasn't found; fuzz testing has very little math built in, so its also good at finding 'blind spots' I (and whomever has to review) might have not thought of..

assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
(vector_len == AVX_256bit ? VM_Version::supports_avx2() :
(vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)

But your assertions are nowhere checking this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those asserts are in vex_prefix_and_encode (

assert(((!is_extended) || (!attributes->is_legacy_mode())),"XMM register should be 0-15");
) and vex_prefix (
assert((!is_extended || (!attributes->is_legacy_mode())),"XMM register should be 0-15");
)

I also haven't found any other instruction that does this check so I could emulate the style.

assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
(vector_len == AVX_256bit ? VM_Version::supports_avx2() :
(vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)

But your assertions are nowhere checking this.

}

void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
assert(VM_Version::supports_evex(), "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(VM_Version::supports_evex(), "");
assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the patch, but also kept the supports_evex() assert

}

void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
assert(VM_Version::supports_evex(), "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

MacroAssembler *_masm) {

static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen,
int vector_len, MacroAssembler *_masm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation corretness

@openjdk openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Nov 17, 2025
@kuksenko
Copy link
Contributor

kuksenko commented Nov 19, 2025

What is the reason to add a new microbenchmark?
We already have enough micros covering MLDSA:

org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair
org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign
org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify
org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair
org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign
org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

@vpaprotsk
Copy link
Contributor Author

What is the reason to add a new microbenchmark? We already have enough micros covering MLDSA:

org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

I can definitely remove it, got no strong attachment to it.. I did find it useful during development and thought it might be useful during review to verify performance.. but the usefulness of it beyond is indeed debatable.

You might notice its a lot more 'granular'; it measures the performance of the intrinsics themselves, not the ("10-level deep") "wrappers". That said, those "wrappers" is what actual user will see and what we should be measuring.

This new benchmark is only useful to another intrinsic developer.. (but it should already be usable by other platforms not just Intel?)

@kuksenko
Copy link
Contributor

kuksenko commented Nov 19, 2025

What is the reason to add a new microbenchmark? We already have enough micros covering MLDSA:
org.openjdk.bench.javax.crypto.full.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA.verify org.openjdk.bench.javax.crypto.small.KeyPairGeneratorBench.MLDSA.generateKeyPair org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.sign org.openjdk.bench.javax.crypto.small.SignatureBench.MLDSA.verify

I can definitely remove it, got no strong attachment to it.. I did find it useful during development and thought it might be useful during review to verify performance.. but the usefulness of it beyond is indeed debatable.

You might notice its a lot more 'granular'; it measures the performance of the intrinsics themselves, not the ("10-level deep") "wrappers". That said, those "wrappers" is what actual user will see and what we should be measuring.

This new benchmark is only useful to another intrinsic developer.. (but it should already be usable by other platforms not just Intel?)

I understand your reasons. The question is whether you'll need the microbenchmark in the future. If no (or probably no), please remove the micro.
If needed, please move it from the "org.openjdk.bench.javax.crypto.full" package to "org.openjdk.bench.javax.crypto". It is supposed to have only public API micros in packages "small" and "full"

// r1 in Quotient
// r1 = r1 & quotient; // copy 0 or keep as is, using EqMsk as filter
for (int i = 0; i < regCnt; i++) {
// FIXME: replace with void evmovdqul(Address dst, KRegister mask, XMMRegister src, bool merge, int vector_len);?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fixme a leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Removed. (I think I was considering merging this instruction with the storeXmm, but there really isnt a good way to do that)

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@vpaprotsk
Copy link
Contributor Author

I understand your reasons. The question is whether you'll need the microbenchmark in the future. If no (or probably no), please remove the micro. If needed, please move it from the "org.openjdk.bench.javax.crypto.full" package to "org.openjdk.bench.javax.crypto". It is supposed to have only public API micros in packages "small" and "full"

@kuksenko I decided to just remove it. If anyone wants it back, its in my git history (I usually keep my branches after merge..)

@iwanowww
Copy link
Contributor

iwanowww commented Nov 20, 2025

If anyone wants it back, its in my git history (I usually keep my branches after merge..)

You could put a comment with the link into JBS issue to make it easier to discover later. (Or just attach the source file there.)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 20, 2025
@vpaprotsk
Copy link
Contributor Author

@iwanowww thanks for the suggestion! attached to JBS.

@mcpowers would you mind running your internal test suite for this PR? I am thinking of integrating early next week, if no objections; getting close to the release deadline, dont want to cut it even closer..

@mcpowers
Copy link
Contributor

Always faster and never slower:

SignatureBench.MLDSA with +UseDilithiumIntrinsics shows an average 1.61% improvement across all algorithms and data sizes.
Measuring SignatureBench.MLDSA against a baseline build without the fix, shows an average 2.24% improvement across all algorithms and data sizes.

There's nothing special about my benchmark. It's the one in OpenJDK (javax.crypto.full.SignatureBench).

Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

@ferakocz
Copy link
Contributor

Good work! I just found a few typos in the comments.

// | B | | D | | ...
// +-----+-----+-----+-----+-----
//
// NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: respectivelly -> respectively

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// the odd numbered slots of a third register.
// 2. Swap the even and odd numbered slots of the original input registers.
// 3. Similar to step 1, but into a different output register.
// the odd numbered slots of a scratch2 register.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: scratch2 -> scratch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I meant "the scratch2" register here.. reworded, please double check if its clearer..

// 2. Swap the even and odd numbered slots of the original input registers.
// 3. Similar to step 1, but into a different output register.
// the odd numbered slots of a scratch2 register.
// 2. Swap the even and odd numbered slots of the original input registers.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: unnecessary '*' at the end

// 3. Similar to step 1, but into a different output register.
// the odd numbered slots of a scratch2 register.
// 2. Swap the even and odd numbered slots of the original input registers.*
// 3. Similar to step 1, but into output register.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: into output register -> into an output register

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?

// registers indexed by the numbers in inputRegs2 will contain the same number,
// this should be indicated by calling this function with
// input2NeedsShuffle=false .
// (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: unnecessary '(*' at the beginning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..

// input in two stages. First half, load 8 registers 32 integers each apart.
// With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
// Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
// 2-, 1-integer appart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appart -> apart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks like I've always misspelled that word! :)

// With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
// Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
// 2-, 1-integer appart)
// Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appart -> apart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
// Other levels, shuffles can be done by re-aranging register order

// Four batches of 8 registers each, 128 bytes appart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appart -> apart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


for (int i = 0; i < 8; i++) {
__ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
// Four batches of 8 registers each, 128 bytes appart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appart -> apart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// the substartion is (Montgomery) multiplied by the corresponding zetas.
// In each level we just collect the coefficients (using evpermi2d()
// instructions where necessary, i.e. on levels 0-4) so that the results of
// the substration is (Montgomery) multiplied by the corresponding zetas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substration -> subtraction (I know this was in my own comment :-( )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (funny, thats exactly how I say "substraction" in my head too :D )

@vpaprotsk
Copy link
Contributor Author

SignatureBench.MLDSA with +UseDilithiumIntrinsics shows an average 1.61% improvement across all algorithms and data sizes. Measuring SignatureBench.MLDSA against a baseline build without the fix, shows an average 2.24% improvement across all algorithms and data sizes.

Need bit of clarification.. (I think you are saying there is a regression?).

  • +UseDilithiumIntrinsics should be redundant (i.e. vm_version_x86.cpp should automatically detect and turn the feature on).
    • So if I read correctly.. the baseline measured is already has the original intrinsics (implicitly) enabled..
      • therefore there is a 2.24% noise in the benchmark?

In my measurements for AVX512 parts, I had seen between 0%->6% across SignatureBench.MLDSA
- (some variation on desktop-vs-server parts..)
- SignatureBench.MLDSA.verify was worse, only 0->2% depending on keysize (iirc, bigger portion of benchmark was in SHA3 instead)
- SignatureBench.MLDSA.sign was better, 4-6% (also depending on datasize)

That is also why I had included the other (deleted) microbenchmark.. SignatureBench.MLDSA has a lot of 'other things' (e.g. SHA3) also happening, so the AVX512 intrinsic changes were harder to differentiate from noise..
- I had measured ~25%-50% improvement on purely the 5 intrinsics changed..

Hence the claim 'never worse'.. A more precise claim..:
- "New intrinsics seem to be better, but (at least for AVX512) existing intrinsics were already plenty good for MLDSA"

@mcpowers
Copy link
Contributor

The 2.24% improvement is the difference between +UseDilithiumIntrinsics and -UseDilithiumIntrinsics. I just repeated the testing that you documented in the description section of this PR on a different machine.

My baseline is simply a build without your changes. I compared this with a build containing your changes and see a 2.24% improvement.

Verification showed the least amount of improvement (same as what you observed).

"never worse" is just my way of saying "always faster".

Copy link
Contributor Author

@vpaprotsk vpaprotsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcpowers Thanks for tests!

@ferakocz thanks for the review! I think I took them all in, except for the montMul comment section.. Not quite what I meant so tried to reword.. see if it helps any?

// | B | | D | | ...
// +-----+-----+-----+-----+-----
//
// NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// the odd numbered slots of a third register.
// 2. Swap the even and odd numbered slots of the original input registers.
// 3. Similar to step 1, but into a different output register.
// the odd numbered slots of a scratch2 register.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I meant "the scratch2" register here.. reworded, please double check if its clearer..

// registers indexed by the numbers in inputRegs2 will contain the same number,
// this should be indicated by calling this function with
// input2NeedsShuffle=false .
// (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..

// 3. Similar to step 1, but into a different output register.
// the odd numbered slots of a scratch2 register.
// 2. Swap the even and odd numbered slots of the original input registers.*
// 3. Similar to step 1, but into output register.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?

// If so, use output:
const XMMRegister* scratch = scratch1 == input1 ? output: scratch1;

// scratch = input1_even*intput2_even
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
// Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
// 2-, 1-integer appart)
// Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
// 2-, 1-integer appart)
// Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
// Other levels, shuffles can be done by re-aranging register order
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
// Other levels, shuffles can be done by re-aranging register order

// Four batches of 8 registers each, 128 bytes appart
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// the substartion is (Montgomery) multiplied by the corresponding zetas.
// In each level we just collect the coefficients (using evpermi2d()
// instructions where necessary, i.e. on levels 0-4) so that the results of
// the substration is (Montgomery) multiplied by the corresponding zetas.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (funny, thats exactly how I say "substraction" in my head too :D )


for (int i = 0; i < 8; i++) {
__ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
// Four batches of 8 registers each, 128 bytes appart
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Nov 24, 2025
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 24, 2025
Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @vpaprotsk ,
Please also add in comments the links to original reference implimentation of stub.

Assembler::AVX_512bit, scratch); // 2^64 mod q
vector_len, scratch); // 2^64 mod q
if (vector_len == Assembler::AVX_512bit) {
__ mov64(scratch, 0b0101010101010101);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add long constant suffix

} else {
__ evpbroadcastd(constant, rConstant, Assembler::AVX_512bit); // constant multiplier

__ mov64(scratch, 0b0101010101010101); //dw-mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant suffix

}
}
static void loadXmms(const XMMRegister destinationRegs[], Register source, int offset,
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {

__ evmovdqul(Address(destination, offset + i * XMMBYTES), xmm(xmmRegs[i]),
Assembler::AVX_512bit);
static void storeXmms(Register destination, int offset, const XMMRegister xmmRegs[],
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {


// zetas (int[128*8]) = c_rarg1
static address generate_dilithiumAlmostInverseNtt_avx(StubGenerator *stubgen,
int vector_len,MacroAssembler *_masm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation

static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen,
MacroAssembler *_masm) {
static address generate_dilithiumNttMult_avx(StubGenerator *stubgen,
int vector_len, MacroAssembler *_masm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation

const XMMRegister Coeffs4_2[] = {xmm5, xmm21, xmm7, xmm23};

// Constants for shuffle and montMul64
__ mov64(scratch, 0b1010101010101010);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bit constant suffix

static address generate_dilithiumMontMulByConstant_avx512(StubGenerator *stubgen,
MacroAssembler *_masm) {
static address generate_dilithiumMontMulByConstant_avx(StubGenerator *stubgen,
int vector_len, MacroAssembler *_masm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org ready Pull request is ready to be integrated rfr Pull request is ready for review security security-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

9 participants