DBScan

For a summary of DBScan, see https://en.wikipedia.org/wiki/DBSCAN.

The DBScan implementation in GeoWave is incrementally evolving. As of 0.9.0, DBScan is implemented using Map-Reduce jobs. Ideally, the algorithm will be moved into a Spark implementation.

The approach chosen meets the following requirements.

Geo-spatial (orthdromic) distance is part of the distance measurement. Two objects cannot be close if they are not Geo-spatial close.
Distance measures can be furthered refined to include other attributes. Two objects can be Geo-spatial close and be far along other dimensions.
The solution is scalable, The implementation is compatible with Hadoop 2.0 infrastructure.
Support Points, Polygons, Lines and Line String
Ideally, the solution is not restricted to Simple Features.
Leverage GeoWave's components, where possible. This is intended to be a GeoWave solution.

The last requirement is intended to take advantage of GeoWave's components. There are a number of general purpose DBScan algorithms available. Some perform quite well. However, assuming geo-spatial distances, GeoWave can offer some short-cuts that present performance and scale advantages.

Just prior to developing DBScan, support for subclass of Near-Neighbors class of algorithms was introduced. with the intent of reuse, the support infrastructure lay the ground work for DBScan. The specific near-neighbors problem set is summarized as follows.

For each item in the data set, find all other items in the same data set that are considered neighbors by a provided distance function.

The distance function assume geo-spatial proximity as one of the primary measures, meeting requirements 1 and 2. Incrementally applied, the infrastructure could be used to build a nearest neighbor graph, clusters, etc. An example use case is finding those items whose cardinality of the set of neighbors exceed some minimum bound.

The solution for near neighbor search involves partitioning the data such that neighboring items are in the same partition. The GeoWave indexing strategy creates geo-hashes as a partition identifier (Mapper key). The width of a grid cell, manually set by a parameter, determines the dimensions of the grid (thus, the upper bound on the number of partitions). The width of cell must exceed twice the maximum distance threshold (\delta) for neighbors. Choosing a fine-grain grid reduces the size of the partitions and increase the number of partitions. Using the index strategy allows support of any mix of dimensions including time.

In each partition, the items are compared to each using a distance function (i.e. Reducer). A pair of items are neighbors if they have a distance less than the maximum distance threshold \delta. For $$M_p$$ items in a partition p, the run time has an upper bound $$M_p$$^2/2, thus, O($$M_p$$).

The chosen partition for each item is called the primary partition of that item. Geometries other than points may be placed in more than one primary partition. Items that are rarely in the exact middle of a partition. Thus, to form the full set of neighbors, items are also submitted to neighbor partitions (within \delta distance). Items included in a neighbor partition are labeled as secondary. The distinction of primary and secondary is used in the search step within the partition. Secondary items are passive participants in the search. The basic algorithm is as follows:

 primaries <- items in primary partition
 neighorSet <- {}
 for each item $$P_i$$ primary in primaries
     remove item from primaries
     for each item $$P_j** in primaries
         dist = distance($$I_i, $$I_j)
         if (dist < \delta) then add tuple ($$I_i, $$I_j, dist) to neighorSet
     for each item $$O_j$$ in secondaries
         dist = distance($$I_i, $$P_j)
         if (dist < \delta) then add tuple ($$I_i, $$P_j, dist) to neighorSet
 return neighborSet

With the intent of reuse this ground work

At this time, The algorithm focused The basic approach of the algorithm is

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DBScan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally