- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
Implement off-heap quantized scoring #14863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. | 
| I ran some benchmarks on Cohere vectors (768d) for 7-bit and 4-bit (compressed) quantization.. 
 
 This PR without  This PR with  I did see slight fluctuation across runs, but the search time was ~10% faster for 7-bit and very slightly faster for 4-bit (compressed). Indexing and force merge times have improved by ~15% | 
| FYI I observed a strange phenomenon where if the query vector is on heap like: this.query = MemorySegment.ofArray(targetBytes);instead of the current off-heap implementation in this PR: this.query = Arena.ofAuto().allocateFrom(JAVA_BYTE, targetBytes);..then we see a performance regression: Maybe I'm missing something obvious, but I haven't found the root cause yet.. | 
| 
 yeah. I've seen similar before. You might be hitting a problem with the loop bound not being hoisted. I will try to take a look. | 
| Thanks @ChrisHegarty! I saw that we use a heap-backed  | 
| This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! | 
| After this conversation I re-ran some benchmarks with  
 
 This PR: 
 Looks like the changes in this PR have a small benefit on the search side, but slow down indexing by a lot.. | 
82e1faa    to
    0167f13      
    Compare
  
    | We had some interesting findings in #14874, so I updated this PR to reflect those.. Benchmarks for 768d Cohere vectors (dot product similarity, 4-bit one is compressed): 
 
 This PR with  This PR without  We see some speedup in both indexing and search for 7-bit compression, not so much for the 4-bit one I wrote specific functions in  The drawback here is that we'll need specific functions for comparing "packed" versions of the query / documents for other similarity functions like "euclidean" as well -- looking for inputs on whether the gains justify maintaining those functions.. Also tagging people who might be interested in these changes, maybe @benwtrent (since you were exploring something similar in #13497) or @ChrisHegarty? | 
        
          
                lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | Thanks for the review @thecoop! I've tried to address your comments above, do let me know if you have more feedback.. | 
| This looks like a compelling optimization?  And the "search in the same JVM that did indexing" issue is separately already resolved?  Since this is such a risk with Java (hotspot head-fake vulnerability), maybe  Were there other pending concerns? I cannot really tell from the discussion... | 
| 
 Yes, that was fixed in #14874 
 One consequence of this change: since we don't want to copy 4-bit values to the Java heap, we need to define functions in  The question I had was whether the benefits are compelling enough to maintain these functions.. | 
| 
 I think the goal for 4bit is that we just have the "compressed" version only. | 
6ce7f50    to
    ac04f41      
    Compare
  
    | Sorry for the delay here! I ran the following benchmarks on 768d Cohere vectors for all vector similarities, with 4-bit (compressed) and 7-bit quantization. I needed to run 10k queries for reliable results (saw some variance in the default case of 1k queries) 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks great. One question on tests.
I think we will get even nicer performance through adding bulk scoring methods :). Though that can be in a later PR I think.
| b[i] = (byte) random().nextInt(16); | ||
| } | ||
|  | ||
| assertIntReturningProviders(p -> p.int4DotProduct(a, false, pack(b), true)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some more tests here to cover:
- cosine
- square distance
And their both packed, both unpacked versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosine does not have a separate function in PanamaVectorUtilSupport (it re-uses dot product) -- but I've added tests for square distance (and their packed versions)
- Also add tests, fix failures
# Conflicts: # lucene/CHANGES.txt
| Thanks @benwtrent ! I realized that this PR would give incorrect results if 8-bit quantization was used (added recently in #15148) -- because it used the  I switched them over to  I'll try to run a test with 8-bit quantization too, I realized this PR will implicitly support it :) | 
# Conflicts: # lucene/CHANGES.txt
| 
 Here are the benchmarks: 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thank you!
Square distance improvements are interesting! I was expecting this for int4 as we didn't really have optimized paths there. But wow!
| Thanks @benwtrent :) If there's no other comments, could someone help with merging? | 
Off-heap scoring for quantized vectors! Related to #13515 This scorer is in-line with [`Lucene99MemorySegmentFlatVectorsScorer`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentFlatVectorsScorer.java#L30), and will automatically be used with [`PanamaVectorizationProvider`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L30C13-L30C40) (i.e. on adding `jdk.incubator.vector`). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here.. I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..
| Thank you @benwtrent ! | 
| @kaivalnp strangely, this added some indexing regression: https://benchmarks.mikemccandless.com/2025.09.16.18.04.08.html I would expect things to be pretty much the same :/ I haven't dug into it yet. | 
| 
 I was just about to post this :) | 
| 
 @benwtrent I opened #15272 | 
Description
Off-heap scoring for quantized vectors! Related to #13515
This scorer is in-line with
Lucene99MemorySegmentFlatVectorsScorer, and will automatically be used withPanamaVectorizationProvider(i.e. on addingjdk.incubator.vector). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..