-
Notifications
You must be signed in to change notification settings - Fork 2
building_graphs
This tutorial shows different methods of building graphs.
The easiest way to build a search graph is to use the function deglib.builder.build_from_data().
import numpy as np
import deglib
data = np.random.random((10_000, 256)).astype(np.float32)
graph = deglib.builder.build_from_data(data)This will instantiate a SizeBoundedGraph containing 10_000 random feature vectors from data.
There are many different parameters to control the building process. See build_from_data() for more information.
The crEG paper introduces an additional parameter, LIDType, to determine whether a dataset exhibits high complexity and Local Intrinsic Dimensionality (LID) or if it is relatively low. In contrast, the DEG paper presents a new algorithm that does not rely on this information. Consequently, DEG defaults to LID.Unknown. However, if the LID is known, utilizing it can be beneficial, as multi-threaded graph construction is only possible with these parameters.
Deglib implements two different metrics:
- The L2 distance or euclidean distance.
- The InnerProduct distance.
The L2 metric is also available for uint8 features as L2_Uint8.
There are two kinds of indices used in a graph: internal_index and external_label. Both are integers and specify a vertex in a graph.
Internal Indices are dense, which means that every internal_index < len(graph) can be used. For example: If you add 100 vertices and remove the vertex with internal_index 42, the last vertex in the graph will be moved to index 42.
In contrast, external label is a user defined identifier for each added vertex (see builder.add_entry(external_label, feature_vector)). Adding or Removing vertices to the graph will keep the connection between external labels and associated feature vector.
When you create the external labels by starting with 0 and increasing it for each entry by 1 and don’t remove elements from the graph, external labels and internal indices are equal.
# as long as no elements are removed
# external labels and internal indices are equal
for i, vec in enumerate(data):
builder.add_entry(i, vec)The above method is easy to use, but has drawbacks. For example it does not allow to load the data step by step, which leads to more memory usage.
If we iterate over batches of data, we can add them to the graph like this:
graph = deglib.graph.SizeBoundedGraph.create_empty(N_SAMPLES, DIM, 16, deglib.Metric.L2)
builder = deglib.builder.EvenRegularGraphBuilder(graph)
counter = 0
for batch in enumerate(batches):
labels = np.arange(counter, batch.shape[0])
builder.add_entry(labels, batch)
counter += batch.shape[0]
builder.build()
graph.remove_non_mrng_edges() # optional. This can help improve performance
# now you can use the graphIf you want to get more feedback on the progress you can supply a callback function to builder.build(callback). The callback takes a BuilderStatus as only argument.
A BuilderStatus has the following attributes:
- step (int) - number of graph manipulation steps
- added (int) - number of added vertices
- deleted (int) - number of deleted vertices
- improved (int) - number of successful improvements
- tries (int) - number of improvement tries