dgraph-io
diff --git a/‎docusaurus-docs/docs/howto/similarity-search.mdx‎
Lines changed: 215 additions & 0 deletions b/‎docusaurus-docs/docs/howto/similarity-search.mdx‎
Lines changed: 215 additions & 0 deletions
@@ -0,0 +1,215 @@
+---
+title: Vector Similarity Search in DQL
+---
+
+Dgraph v24 introduces **vector data type** and **similarity search** to the [DQL query language](../dql/).
+
+This guide shows how to use vector embeddings and similarity search in Dgraph. This example uses [Ratel](../../ratel/) for schema updates, mutations, and queries, but you can use any DQL client.
+
+## Define Schema
+
+Define a DQL schema with a vector predicate. You can set this via the Ratel schema tab using the bulk edit option, or use any DQL client:
+
+```dql
+<Issue.description>: string .
+
+<Issue.vector_embedding>: float32vector @index(hnsw(metric:"euclidean")) .
+
+type <Issue> {
+    Issue.description
+    Issue.vector_embedding
+}
+```
+
+The `float32vector` type is used with the `hnsw` index type. The `hnsw` index supports different distance metrics: `cosine`, `euclidean`, or `dotproduct`. This example uses `euclidean` distance.
+
+## Insert Data
+
+Insert data containing vector embeddings using a DQL mutation. You can paste this into Ratel as a mutation, or use curl, pydgraph, or any DQL client:
+
+```json
+{
+  "set": [
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.25, 0.47, 0.8, 0.27]",
+      "Issue.description": "Intermittent timeouts. Logs show no such host error."
+    },
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.57, 0.23, 0.68, 0.41]",
+      "Issue.description": "Bug when user adds record with blank surName. Field is required so should be checked in web page."
+    },
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.26, 0.12, 0.77, 0.57]",
+      "Issue.description": "Delays on responses every 30 minutes with high network latency in backplane"
+    },
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.45, 0.49, 0.72, 0.2]",
+      "Issue.description": "vSlow queries intermittently. The host is not found according to logs."
+    },
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.52, 0.05, 0.22, 0.82]",
+      "Issue.description": "Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
+    },
+    {
+      "dgraph.type": "Issue",
+      "Issue.vector_embedding": "[0.33, 0.64, 0.16, 0.68]",
+      "Issue.description": "Host and DNS issues are causing timeouts in the User Details web page"
+    }
+  ]
+}
+```
+
+:::note
+For simplicity, this example uses small 4-dimensional vectors. In production, you would typically use vectors generated by ML models (e.g., embeddings from language models) which are usually 384, 512, 768, or more dimensions. The embeddings in this example represent four concepts in the four vector dimensions: slowness/delays, logging/messages, networks, and GUIs/web pages.
+:::
+
+## Basic Similarity Query
+
+Use the `similar_to()` function to find similar items. For example, to find issues similar to a new issue description "Slow response and delay in my network!", represent it as the vector `[0.28, 0.75, 0.35, 0.48]`.
+
+The `similar_to()` function takes three parameters:
+1. The DQL field name (predicate)
+2. The number of results to return
+3. The vector to search for
+
+```dql
+query slownessWithLogs() {
+  simVec(func: similar_to(Issue.vector_embedding, 3, "[0.28, 0.75, 0.35, 0.48]")) {
+    uid
+    Issue.description
+  }
+}
+```
+
+### Using Query Variables
+
+You can use query variables to pass the vector dynamically:
+
+```dql
+query test($vec: float32vector) {
+  simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
+    uid
+    Issue.description
+  }
+}
+```
+
+When making the request, set the variable `vec` to a JSON float array:
+
+```json
+{
+  "vec": [0.28, 0.75, 0.35, 0.48]
+}
+```
+
+## Computing Vector Distances and Similarity Scores
+
+The `similar_to()` function uses the `hnsw` index with the distance metric declared in the schema (in this case, `euclidean` distance). 
+
+In some cases, you may want to compute the distance or similarity score explicitly. Keep in mind:
+- **Distance**: Lower values indicate more similarity
+- **Similarity score**: Higher values indicate more similarity
+
+Dgraph v24 introduces the `dot` function to compute the dot product of vectors, which you can use to compute various similarity metrics.
+
+### Distance Metrics
+
+Given two vectors $$A=[a_1,a_2,...,a_n]$$ and $$B=[b_1,b_2,...,b_n]$$:
+
+**Euclidean distance** is the L2 norm of A - B:
+
+$$
+D = \sqrt{(a_1 - b_1)^2+...+(a_n - b_n)^2}
+$$
+
+Which can be expressed as:
+
+$$D = \sqrt{(A-B) \cdot (A-B)}$$
+
+**Cosine similarity** measures the angle between two vectors:
+
+$$cosine(A,B) = \frac{A \cdot B}{||A|| \cdot ||B||}$$
+
+Cosine similarity ranges from -1 to 1 (where 1 means identical vectors). It's often converted to **cosine distance**:
+
+$$cosine\_distance(A,B) = 1 - cosine(A,B)$$
+
+When vectors are normalized ($$||A|| = 1$$ and $$||B|| = 1$$), which is usually the case with vector embeddings from ML models, cosine computation can be simplified using only a dot product:
+
+$$dotproduct\_distance = 1 - A \cdot B$$
+
+A common use case is to compute a **similarity score** or confidence. For normalized vectors:
+
+$$similarity = \frac{1 + A \cdot B}{2}$$
+
+This metric ranges from 0 to 1, with 1 being as similar as possible, making it useful for applying thresholds.
+
+### Computing Distances in DQL
+
+Here's an example query that computes euclidean, cosine, and dot product distances:
+
+```dql
+query slownessWithLogs($vec: float32vector) {
+  simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
+    uid
+    Issue.description
+    vemb as Issue.vector_embedding
+
+    euclidean_distance: Math(sqrt(($vec - vemb) dot ($vec - vemb)))
+
+    dotproduct_distance: Math(1.0 - (($vec) dot vemb))
+
+    cosine as Math((($vec) dot vemb) / sqrt((($vec) dot ($vec)) * (vemb dot vemb)))
+    cosine_distance: Math(1.0 - cosine)
+
+    similarity_score: Math((1.0 + (($vec) dot vemb)) / 2.0)
+  }
+}
+```
+
+You typically compute the same distance as defined in the index, or use the similarity score.
+
+### Ordering Results by Similarity Score
+
+The following query computes the similarity score in a variable and uses it to order the 3 closest nodes by similarity:
+
+```dql
+query slownessWithLogs($vec: float32vector) {
+  var(func: similar_to(Issue.vector_embedding, 3, $vec)) {
+    vemb as Issue.vector_embedding
+    score as Math((1.0 + (($vec) dot vemb)) / 2.0)
+  }
+  # score is now a map of uid -> similarity_score
+
+  simVec(func: uid(score), orderdesc: val(score)) {
+    uid
+    Issue.description
+    score: val(score)
+  }
+}
+```
+
+## Summary
+
+This guide demonstrates how to:
+- Define a schema with vector predicates and `hnsw` indexes
+- Insert data with vector embeddings
+- Perform similarity searches using the `similar_to()` function
+- Compute various distance metrics and similarity scores using the `dot` function
+
+For production use cases, you would typically:
+1. Generate vector embeddings from your text/data using ML models (e.g., sentence transformers, OpenAI embeddings)
+2. Store these embeddings in Dgraph
+3. Use `similar_to()` to find semantically similar items
+4. Optionally compute similarity scores to filter or rank results
+
+## Related Topics
+
+- [DQL Schema](../dql/dql-schema) - Learn about DQL schema definitions
+- [DQL Query](../dql/query/dql-query) - Learn about DQL queries
+- [Predicate Indexing](../dql/predicate-indexing) - Learn about indexing options