- Documentation for 'Dictionary' Abstract Data Structure
- Single File Plain Java Implementation
Extendible hashing is a dynamic hashing technique that adjusts its structure as the dataset grows or shrinks, avoiding the performance limitations of static hashing.
In static hashing, the size of the hash table is fixed. Once it becomes full, collisions increase, causing degraded performance. Extendible hashing solves this by expanding and splitting only when necessary, without a complete table rebuild.
- Dynamic Growth: Directory size doubles only when needed, keeping space usage efficient.
- Reduced Collisions: Bucket splits are localized; only the bucket that overflows is split.
- Better Distribution: Keys are distributed based on higher-order bits of their hash value, minimizing clustering.
- No Full Rehashing: Unlike static hashing, existing keys remain in place unless their bucket is split.
- Each bucket has:
- A local depth (number of bits used from the hash to determine placement in that bucket).
- A fixed capacity of 2 entries in this implementation.
- A bucket overflows when trying to insert a new key-value pair beyond its capacity.
- Buckets can be shared between multiple directory entries if their local depth is smaller than the global depth.
- The directory is an array of references (pointers) to buckets.
- Global depth = number of bits from the hash value used to index into the directory.
- When a bucket split increases its local depth beyond the current global depth, the directory size doubles.
- Each directory index maps to a bucket based on the binary representation of the hash value.
- Generates a hash value from keys.
- Supports integers and strings:
- Integers: Directly hashed using modulo bit-masking.
- Strings: Converted into a numeric hash by summing character codes with positional multipliers.
- Mixing Higher Bits:
- Instead of relying only on lower bits (which often cause clustering), higher bits of the hash are used when directory depth increases.
- This ensures that when splitting, keys are evenly redistributed, greatly reducing collisions.
- Handled by splitting buckets rather than linear probing or chaining.
- Higher bits from the hash are used when the directory depth increases, ensuring that newly split buckets receive distinct key ranges.
- This prevents repeated collisions that would occur if only lower bits were used.
There are two types of splits:
- Occurs when:
- The bucket’s local depth < global depth.
- Only the overflowing bucket is split into two new buckets.
- The directory entries pointing to the old bucket are split between the two new buckets.
- Local depth of both new buckets = old local depth + 1.
- Occurs when:
- The bucket’s local depth == global depth.
- Steps:
- Double the directory size (global depth + 1).
- Redistribute existing directory pointers.
- Perform a local split on the overflowing bucket.
When a bucket overflows:
- Check if local depth == global depth.
- If yes → double the directory size and increment global depth.
- Create two new buckets with:
- Local depth = old local depth + 1.
- Redistribute entries from the old bucket into the two new buckets based on the new bit position from the hash.
- Update directory pointers so each entry points to the correct new bucket.
index = hashValue & ((1 << globalDepth) - 1);
- Purpose: Selects the lowest
globalDepth
bits of the hash value to determine which directory entry to use. - How:
(1 << globalDepth)
→ Creates a bitmask with a1
followed byglobalDepth
zeros.- Subtracting
1
turns it into a mask withglobalDepth
ones. &
withhashValue
extracts only those bits.
bit = (hashValue >> localDepth) & 1;
- Purpose: Checks the next higher-order bit beyond the current
localDepth
to decide if the entry goes to the new bucket or stays in the old one. - How:
>> localDepth
shifts the hash value right bylocalDepth
bits.& 1
isolates the lowest bit of the shifted value.
for (int i = 0; i < oldSize; i++) { newDir[i] = oldDir[i]; newDir[i + oldSize] = oldDir[i]; }
- Purpose: Duplicates the directory pointers when
globalDepth
increases. - How:
oldSize
= current directory length.- The first half of
newDir
is a copy ofoldDir
. - The second half is also a copy of
oldDir
(pointing to the same buckets initially).
- Grows gracefully without massive rehashing.
- Minimizes collisions by using more bits when necessary.
- Memory-efficient by sharing buckets across multiple directory entries until a split is needed.