Skip to content

Conversation

kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented Oct 2, 2025

Addresses #15284

- Refactor internal classes
- Add mising javadocs
- Remove unused functions
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 6, 2025

VectorUtilBenchmark results:

main

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                            1024  thrpt   15   0.841 ± 0.001  ops/us
VectorUtilBenchmark.binaryCosineVector                            1024  thrpt   15   4.778 ± 0.012  ops/us
VectorUtilBenchmark.binaryDotProductScalar                        1024  thrpt   15   2.289 ± 0.012  ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar                   1024  thrpt   15   2.307 ± 0.010  ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector                   1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryDotProductVector                        1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.368 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  11.652 ± 0.104  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.378 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.446 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.627 ± 0.013  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.677 ± 0.160  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.642 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  12.614 ± 0.010  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.465 ± 0.006  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.022 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.590 ± 0.012  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.526 ± 0.012  ops/us
VectorUtilBenchmark.binarySquareScalar                            1024  thrpt   15   2.431 ± 0.007  ops/us
VectorUtilBenchmark.binarySquareUint8Scalar                       1024  thrpt   15   2.422 ± 0.025  ops/us
VectorUtilBenchmark.binarySquareUint8Vector                       1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.binarySquareVector                            1024  thrpt   15   6.710 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineScalar                             1024  thrpt   15   1.419 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector                             1024  thrpt   75   8.913 ± 0.013  ops/us
VectorUtilBenchmark.floatDotProductScalar                         1024  thrpt   15   3.734 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVector                         1024  thrpt   75  12.561 ± 0.346  ops/us
VectorUtilBenchmark.floatSquareScalar                             1024  thrpt   15   3.181 ± 0.013  ops/us
VectorUtilBenchmark.floatSquareVector                             1024  thrpt   75  12.370 ± 0.398  ops/us
VectorUtilBenchmark.l2Normalize                                   1024  thrpt   15   3.016 ± 0.002  ops/us
VectorUtilBenchmark.l2NormalizeVector                             1024  thrpt   75  12.349 ± 0.719  ops/us

This PR

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                            1024  thrpt   15   0.841 ± 0.001  ops/us
VectorUtilBenchmark.binaryCosineVector                            1024  thrpt   15   4.860 ± 0.007  ops/us
VectorUtilBenchmark.binaryDotProductScalar                        1024  thrpt   15   2.298 ± 0.014  ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar                   1024  thrpt   15   2.288 ± 0.024  ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector                   1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryDotProductVector                        1024  thrpt   15   8.039 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.376 ± 0.003  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  11.498 ± 0.286  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.376 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.449 ± 0.007  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.627 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.785 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.696 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  12.562 ± 0.023  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.474 ± 0.010  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.021 ± 0.006  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.609 ± 0.015  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.487 ± 0.075  ops/us
VectorUtilBenchmark.binarySquareScalar                            1024  thrpt   15   2.413 ± 0.021  ops/us
VectorUtilBenchmark.binarySquareUint8Scalar                       1024  thrpt   15   2.420 ± 0.017  ops/us
VectorUtilBenchmark.binarySquareUint8Vector                       1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.binarySquareVector                            1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineScalar                             1024  thrpt   15   1.415 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector                             1024  thrpt   75   8.646 ± 0.080  ops/us
VectorUtilBenchmark.floatDotProductScalar                         1024  thrpt   15   3.733 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector                         1024  thrpt   75  12.249 ± 0.046  ops/us
VectorUtilBenchmark.floatSquareScalar                             1024  thrpt   15   3.171 ± 0.008  ops/us
VectorUtilBenchmark.floatSquareVector                             1024  thrpt   75  12.483 ± 0.104  ops/us
VectorUtilBenchmark.l2Normalize                                   1024  thrpt   15   3.017 ± 0.002  ops/us
VectorUtilBenchmark.l2NormalizeVector                             1024  thrpt   75  12.207 ± 0.764  ops/us

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 6, 2025

Ran some luceneutil benchmarks on Cohere vectors, 768d for various vector similarities x quantization bits:

dot_product

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.675   0.666        0.987  200000   100      50       32        250     1 bits     5101     10.74      18627.18           20.85             1          624.45       606.918       20.981       HNSW
 0.878        1.170   1.161        0.992  200000   100      50       32        250     4 bits     4662     12.20      16398.82           23.07             1          678.09       662.231       76.294       HNSW
 0.915        1.517   1.505        0.992  200000   100      50       32        250     7 bits     4605     12.58      15896.99           31.01             1          751.27       735.474      149.536       HNSW
 0.915        1.523   1.515        0.995  200000   100      50       32        250     8 bits     4570     11.64      17180.65           18.18             1          751.17       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.678   0.668        0.985  200000   100      50       32        250     1 bits     5064     10.83      18467.22           21.32             1          624.43       606.918       20.981       HNSW
 0.876        1.140   1.131        0.992  200000   100      50       32        250     4 bits     4660     11.67      17132.09           23.35             1          678.10       662.231       76.294       HNSW
 0.914        1.514   1.504        0.993  200000   100      50       32        250     7 bits     4575     12.34      16208.77           18.19             1          751.21       735.474      149.536       HNSW
 0.916        1.576   1.566        0.994  200000   100      50       32        250     8 bits     4580     12.32      16229.81           18.29             1          751.23       735.474      149.536       HNSW

mip

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.640        0.754   0.745        0.988  200000   100      50       32        250     1 bits     5076     11.12      17987.23           20.55             1          624.43       606.918       20.981       HNSW
 0.877        1.174   1.165        0.992  200000   100      50       32        250     4 bits     4645     11.95      16737.80           24.10             1          678.11       662.231       76.294       HNSW
 0.912        1.566   1.557        0.994  200000   100      50       32        250     7 bits     4573     11.96      16723.81           18.21             1          751.21       735.474      149.536       HNSW
 0.916        1.509   1.500        0.994  200000   100      50       32        250     8 bits     4578     12.18      16416.32           18.29             1          751.19       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.709   0.700        0.987  200000   100      50       32        250     1 bits     5080     11.68      17120.36           20.85             1          624.44       606.918       20.981       HNSW
 0.877        1.191   1.182        0.992  200000   100      50       32        250     4 bits     4654     11.61      17232.47           22.12             1          678.11       662.231       76.294       HNSW
 0.914        1.527   1.518        0.994  200000   100      50       32        250     7 bits     4585     12.27      16306.56           18.17             1          751.22       735.474      149.536       HNSW
 0.915        1.541   1.532        0.994  200000   100      50       32        250     8 bits     4582     11.70      17091.10           18.30             1          751.22       735.474      149.536       HNSW

euclidean

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.691        0.625   0.615        0.984  200000   100      50       32        250     1 bits     4723      9.64      20751.19           17.36             1          615.12       606.918       20.981       HNSW
 0.906        0.993   0.979        0.986  200000   100      50       32        250     4 bits     4413     10.70      18698.58           21.10             1          669.73       662.231       76.294       HNSW
 0.948        1.361   1.353        0.994  200000   100      50       32        250     7 bits     4389     12.22      16369.29           25.86             1          743.24       735.474      149.536       HNSW
 0.950        1.335   1.326        0.993  200000   100      50       32        250     8 bits     4387     11.31      17691.29           25.83             1          743.26       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.692        0.628   0.618        0.984  200000   100      50       32        250     1 bits     4741     10.19      19627.09           17.71             1          615.11       606.918       20.981       HNSW
 0.905        0.987   0.977        0.990  200000   100      50       32        250     4 bits     4416     10.46      19118.63           20.92             1          669.72       662.231       76.294       HNSW
 0.949        1.396   1.387        0.994  200000   100      50       32        250     7 bits     4395     12.06      16579.62           25.65             1          743.22       735.474      149.536       HNSW
 0.951        1.332   1.316        0.988  200000   100      50       32        250     8 bits     4382     12.03      16629.25           25.74             1          743.24       735.474      149.536       HNSW

cosine

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.656        0.641   0.632        0.986  200000   100      50       32        250     1 bits     4996     10.17      19663.75           17.60             1          616.88       606.918       20.981       HNSW
 0.889        1.078   1.069        0.992  200000   100      50       32        250     4 bits     4603     10.64      18793.46           23.01             1          671.76       662.231       76.294       HNSW
 0.944        1.438   1.429        0.994  200000   100      50       32        250     7 bits     4537     12.14      16477.18           27.64             1          745.81       735.474      149.536       HNSW
 0.948        1.459   1.450        0.994  200000   100      50       32        250     8 bits     4524     11.83      16913.32           27.53             1          745.93       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.657        0.644   0.635        0.986  200000   100      50       32        250     1 bits     5006     10.30      19411.82           17.96             1          616.85       606.918       20.981       HNSW
 0.888        0.994   0.985        0.991  200000   100      50       32        250     4 bits     4565     11.39      17556.18           22.29             1          671.74       662.231       76.294       HNSW
 0.945        1.422   1.413        0.994  200000   100      50       32        250     7 bits     4522     11.72      17064.85           27.42             1          745.81       735.474      149.536       HNSW
 0.948        1.442   1.433        0.994  200000   100      50       32        250     8 bits     4514     11.94      16746.21           26.94             1          745.94       735.474      149.536       HNSW

Except for one outlier (dot_product, main, force_merge(s)), all values appear to be within ~5% of each other

# Conflicts:
#	lucene/core/src/java25/org/apache/lucene/internal/vectorization/VectorizedVectorUtilSupport.java
@kaivalnp kaivalnp marked this pull request as ready for review October 8, 2025 13:28

/** A vectorization provider that leverages the Panama Vector API. */
final class PanamaVectorizationProvider extends VectorizationProvider {
final class VectorizedVectorizationProvider extends VectorizationProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just keep the class name the same? The Panama name is correct here. Please don't change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the other classes. Everything which uses incubatoing APIs should keep its name with "Panama" (as it is called "Panama Vectorization" in the JEP).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't have strong opinions on this. Changed back to the original Panama* names :)

@uschindler
Copy link
Contributor

I am not able to do any close review here, so please don't merge this now.

@uschindler uschindler marked this pull request as draft October 12, 2025 01:03
@mikemccand
Copy link
Member

mikemccand commented Oct 13, 2025

This PR

Maybe we could enhance Lucene's jmh infra so it can compare baseline/candidate runs somehow? It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference... maybe open spinoff issue?

Edit: heh, and some comment about luceneutil's knnPerfTest.py? That tool has really flowered over time (and is now run in nightly benchmarks too) for testing all the many KNN options Lucene offers...

@kaivalnp
Copy link
Contributor Author

It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference

Haha true :)
I fed the raw data to an LLM and asked it to report percentage differences:

Benchmark Baseline Score (ops/μs) Candidate Score (ops/μs) % Difference
floatCosineVector 8.913 8.646 -3.00%
floatDotProductVector 12.561 12.249 -2.48%
binaryHalfByteDotProductBothPackedVector 11.652 11.498 -1.32%
l2NormalizeVector 12.349 12.207 -1.15%
binaryDotProductUint8Scalar 2.307 2.288 -0.82%
binarySquareScalar 2.431 2.413 -0.74%
binaryHalfByteSquareBothPackedVector 12.614 12.562 -0.41%
floatSquareScalar 3.181 3.171 -0.31%
floatCosineScalar 1.419 1.415 -0.28%
binaryHalfByteSquareVector 18.526 18.487 -0.21%
binaryHalfByteDotProductScalar 2.378 2.376 -0.08%
binarySquareUint8Scalar 2.422 2.420 -0.08%
binaryHalfByteSquareSinglePackedScalar 2.022 2.021 -0.05%
floatDotProductScalar 3.734 3.733 -0.03%
binaryDotProductVector 8.040 8.039 -0.01%
binarySquareVector 6.710 6.709 -0.01%
binaryCosineScalar 0.841 0.841 0.00%
binaryDotProductUint8Vector 8.040 8.040 0.00%
binaryHalfByteDotProductSinglePackedVector 2.627 2.627 0.00%
binarySquareUint8Vector 6.709 6.709 0.00%
l2Normalize 3.016 3.017 0.03%
binaryHalfByteDotProductSinglePackedScalar 2.446 2.449 0.12%
binaryHalfByteDotProductBothPackedScalar 2.368 2.376 0.34%
binaryHalfByteSquareScalar 2.465 2.474 0.36%
binaryDotProductScalar 2.289 2.298 0.39%
binaryHalfByteDotProductVector 20.677 20.785 0.52%
binaryHalfByteSquareSinglePackedVector 2.590 2.609 0.73%
floatSquareVector 12.370 12.483 0.91%
binaryCosineVector 4.778 4.860 1.72%
binaryHalfByteSquareBothPackedScalar 1.642 1.696 3.29%

Side note: I found this cool visualizer (https://jmh.morethan.io), which takes the JSON output of JMH (add -rf json to the command line), and can compare multiple runs too!

For example, I re-ran a subset of functions and recorded their output in https://gist.github.com/kaivalnp/0424bd84326aebdecd10f8144fb46c73
Now we can visualize the results at: https://jmh.morethan.io/?gist=0424bd84326aebdecd10f8144fb46c73

Also found this GH action that automatically runs and compares JMH output: https://github.com/benchmark-action/github-action-benchmark, might be interesting to add to Lucene!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants