Skip to content

Heuristic for Choosing Index Spatial and Spatial Temporal Indices

Eric Robertson edited this page Dec 14, 2015 · 11 revisions

This page discussions heuristics for choosing indices. Along with discussions focused on secondary indexing, this discussion will result in solutions integrated into a cost based optimizer.

The first requirement is determining a low-cost (computative resource) solution for choosing between a spatial index and a spatial temporal index without using statistics. Each index has an assigned set of dimensions: latitude, longitude and time. Within a bin, time is bounded. For example, a default bin is one year. The time bounds depend on the index strategy configuration.

Heuristic without Statistics

Ideally, an an effective index strategy reduces the size and number of ranges of cells (cubes) to search, indexed along the Hilbert space filling curve. A query with a temporal constraint may represent an elongated hyper-rectangle, resulting an many small ranges along of the Hilbert curve. In this case, a spatial index combined with distributed filtering is an optimal. In this scenario, time is the least constraining dimension.

In general, the idea is to make a judgement by comparing the least constraining dimensions.

The number of cells per each integral value for a specific dimension d of index i (d is in {latitude, longitude, time}) is 2^bits(d,i)/(SFCmax(d) - SFCmin(d)).

For example, along the longitudinal axis, at 20 bits of precision and a range of 360 degrees, the number of cells per degree is roughly 2913. Multiplied by the query range yields the number of cells to be scanned by the query scanner. For example, a query over a range 46-48 degrees 2913*2 requires a search of 5826 cells.

CELLS(d,i) = QR(d,i) * 2^bits(d,i)/(SFCmax(d) - SFCmin(d))

The maximum of the total possible cells for each dimension yields the least constraining dimension for a query and a given index. Call this value TCmax(i). Using algebra, compute a an adjusted Query Range AQR(i) for a each dimension:

AQR(d,i) = TCmax(i) * (SFCmax(d) - SFCmin(d)) / 2^bits(d, i).

Using the adjusted dimension, recompute (conflation) the search cells. Applying some algebra, the adjusted cell amount is TCMax(i).

ACELLS(d,i) = AQR(d,i) * 2^bits(d,i)/(SFCmax(d) - SFCmin(d))
          = TCmax(i) * (SFCmax(d) - SFCmin(i)) / 2^bits(d,i)  * 2^bits(d,i)/(SFCmax(d) - SFCmin(d))
          = TCmax(i)

At this point, just double or cube TCmax(i) (per two or three dimensions, respectively) for index. The smallest value corresponds to the optimal index.

An aside: Consider that, in computation of the Hilbert curve, bits across each dimension are interleaved. Thus, the least constraining dimension occupies the most significant bit. A set of interleaved bits, one per dimension can be viewed as a bisecting hyperplane. The goal can then be restated as minimizing the number of bisecting hyperplanes.

Statistics Approach

Determine the most constraining dimension using a histogram. First determine the number of items that fall within the adjusted query range (from above AQR(d,i)) per dimension. Choose the index i which contains the smallest most value.

Brute Force Approach with Statistics

Compute the query ranges for each index. Then intersect with the histograms over row ranges for index. Choose the set of ranges for the smallest range set.

Clone this wiki locally