Skip to content

Performance improvement in SimHashIterator #78

@itsibitzi

Description

@itsibitzi

The current SimHashIterator works by finding the most significant active bit, and then calculating the number of SimHash buckets that will exist from that bit to the end of the filter. This means we are iterating (most_significant_bit / SIM_BUCKET_SIZE) + SIM_BUCKETS buckets.

The problem here is that in the higher order bits the sparsity is very high, meaning that there will potentially be many buckets where the values we're mixing into our SimHashes will just be zeros, and given any value XOR 0 is just that value this is wasted computation.

We could move to use the BitChunks iterators which batch up our chunks into a u64 and an index of that block of bits, skipping ranges full of zeros. This will introduce issues where if our SimHash bucket straddles multiple BitChunks, which can be solved by peeking the next chunk, making sure that the index is contiguous with our current chunk.

Right now SIM_BUCKET_SIZE is always 6 bits but if we allowed fully arbitrary SimHash bucket sizes this could get slightly gnarly because we'd need to be able to peek an arbitrary number of BitChunks ahead of the current position.

The following is a line chart plot of the proportion of empty bit_ranges mixed into the SimHashes using pseudorandomly generated geofilters with 0..100_000 items added to them.

Image

The ratio reduces very sharply after a couple of inserts but remains relatively high since newly added items have some chance of landing in a very significant bucket which introduces a load of zero gaps before the next one bit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions