-
Notifications
You must be signed in to change notification settings - Fork 10
Description
The current SimHashIterator
works by finding the most significant active bit, and then calculating the number of SimHash
buckets that will exist from that bit to the end of the filter. This means we are iterating (most_significant_bit / SIM_BUCKET_SIZE) + SIM_BUCKETS
buckets.
The problem here is that in the higher order bits the sparsity is very high, meaning that there will potentially be many buckets where the values we're mixing into our SimHash
es will just be zeros, and given any value XOR 0
is just that value this is wasted computation.
We could move to use the BitChunks
iterators which batch up our chunks into a u64
and an index of that block of bits, skipping ranges full of zeros. This will introduce issues where if our SimHash
bucket straddles multiple BitChunks
, which can be solved by peeking the next chunk, making sure that the index is contiguous with our current chunk.
Right now SIM_BUCKET_SIZE
is always 6
bits but if we allowed fully arbitrary SimHash
bucket sizes this could get slightly gnarly because we'd need to be able to peek an arbitrary number of BitChunk
s ahead of the current position.
The following is a line chart plot of the proportion of empty bit_range
s mixed into the SimHashes using pseudorandomly generated geofilters with 0..100_000
items added to them.

The ratio reduces very sharply after a couple of inserts but remains relatively high since newly added items have some chance of landing in a very significant bucket which introduces a load of zero gaps before the next one bit.