Skip to content

Commit 8b1c717

Browse files
add vector computation howto
Add blog content on vector distance in DQL as doc how to.
1 parent 0f6627d commit 8b1c717

File tree

4 files changed

+432
-0
lines changed

4 files changed

+432
-0
lines changed
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
---
2+
title: Vector Similarity Search in DQL
3+
---
4+
5+
Dgraph v24 introduces **vector data type** and **similarity search** to the [DQL query language](../dql/).
6+
7+
This guide shows how to use vector embeddings and similarity search in Dgraph. This example uses [Ratel](../../ratel/) for schema updates, mutations, and queries, but you can use any DQL client.
8+
9+
## Define Schema
10+
11+
Define a DQL schema with a vector predicate. You can set this via the Ratel schema tab using the bulk edit option, or use any DQL client:
12+
13+
```dql
14+
<Issue.description>: string .
15+
16+
<Issue.vector_embedding>: float32vector @index(hnsw(metric:"euclidean")) .
17+
18+
type <Issue> {
19+
Issue.description
20+
Issue.vector_embedding
21+
}
22+
```
23+
24+
The `float32vector` type is used with the `hnsw` index type. The `hnsw` index supports different distance metrics: `cosine`, `euclidean`, or `dotproduct`. This example uses `euclidean` distance.
25+
26+
## Insert Data
27+
28+
Insert data containing vector embeddings using a DQL mutation. You can paste this into Ratel as a mutation, or use curl, pydgraph, or any DQL client:
29+
30+
```json
31+
{
32+
"set": [
33+
{
34+
"dgraph.type": "Issue",
35+
"Issue.vector_embedding": "[0.25, 0.47, 0.8, 0.27]",
36+
"Issue.description": "Intermittent timeouts. Logs show no such host error."
37+
},
38+
{
39+
"dgraph.type": "Issue",
40+
"Issue.vector_embedding": "[0.57, 0.23, 0.68, 0.41]",
41+
"Issue.description": "Bug when user adds record with blank surName. Field is required so should be checked in web page."
42+
},
43+
{
44+
"dgraph.type": "Issue",
45+
"Issue.vector_embedding": "[0.26, 0.12, 0.77, 0.57]",
46+
"Issue.description": "Delays on responses every 30 minutes with high network latency in backplane"
47+
},
48+
{
49+
"dgraph.type": "Issue",
50+
"Issue.vector_embedding": "[0.45, 0.49, 0.72, 0.2]",
51+
"Issue.description": "vSlow queries intermittently. The host is not found according to logs."
52+
},
53+
{
54+
"dgraph.type": "Issue",
55+
"Issue.vector_embedding": "[0.52, 0.05, 0.22, 0.82]",
56+
"Issue.description": "Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
57+
},
58+
{
59+
"dgraph.type": "Issue",
60+
"Issue.vector_embedding": "[0.33, 0.64, 0.16, 0.68]",
61+
"Issue.description": "Host and DNS issues are causing timeouts in the User Details web page"
62+
}
63+
]
64+
}
65+
```
66+
67+
:::note
68+
For simplicity, this example uses small 4-dimensional vectors. In production, you would typically use vectors generated by ML models (e.g., embeddings from language models) which are usually 384, 512, 768, or more dimensions. The embeddings in this example represent four concepts in the four vector dimensions: slowness/delays, logging/messages, networks, and GUIs/web pages.
69+
:::
70+
71+
## Basic Similarity Query
72+
73+
Use the `similar_to()` function to find similar items. For example, to find issues similar to a new issue description "Slow response and delay in my network!", represent it as the vector `[0.28, 0.75, 0.35, 0.48]`.
74+
75+
The `similar_to()` function takes three parameters:
76+
1. The DQL field name (predicate)
77+
2. The number of results to return
78+
3. The vector to search for
79+
80+
```dql
81+
query slownessWithLogs() {
82+
simVec(func: similar_to(Issue.vector_embedding, 3, "[0.28, 0.75, 0.35, 0.48]")) {
83+
uid
84+
Issue.description
85+
}
86+
}
87+
```
88+
89+
### Using Query Variables
90+
91+
You can use query variables to pass the vector dynamically:
92+
93+
```dql
94+
query test($vec: float32vector) {
95+
simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
96+
uid
97+
Issue.description
98+
}
99+
}
100+
```
101+
102+
When making the request, set the variable `vec` to a JSON float array:
103+
104+
```json
105+
{
106+
"vec": [0.28, 0.75, 0.35, 0.48]
107+
}
108+
```
109+
110+
## Computing Vector Distances and Similarity Scores
111+
112+
The `similar_to()` function uses the `hnsw` index with the distance metric declared in the schema (in this case, `euclidean` distance).
113+
114+
In some cases, you may want to compute the distance or similarity score explicitly. Keep in mind:
115+
- **Distance**: Lower values indicate more similarity
116+
- **Similarity score**: Higher values indicate more similarity
117+
118+
Dgraph v24 introduces the `dot` function to compute the dot product of vectors, which you can use to compute various similarity metrics.
119+
120+
### Distance Metrics
121+
122+
Given two vectors $$A=[a_1,a_2,...,a_n]$$ and $$B=[b_1,b_2,...,b_n]$$:
123+
124+
**Euclidean distance** is the L2 norm of A - B:
125+
126+
$$
127+
D = \sqrt{(a_1 - b_1)^2+...+(a_n - b_n)^2}
128+
$$
129+
130+
Which can be expressed as:
131+
132+
$$D = \sqrt{(A-B) \cdot (A-B)}$$
133+
134+
**Cosine similarity** measures the angle between two vectors:
135+
136+
$$cosine(A,B) = \frac{A \cdot B}{||A|| \cdot ||B||}$$
137+
138+
Cosine similarity ranges from -1 to 1 (where 1 means identical vectors). It's often converted to **cosine distance**:
139+
140+
$$cosine\_distance(A,B) = 1 - cosine(A,B)$$
141+
142+
When vectors are normalized ($$||A|| = 1$$ and $$||B|| = 1$$), which is usually the case with vector embeddings from ML models, cosine computation can be simplified using only a dot product:
143+
144+
$$dotproduct\_distance = 1 - A \cdot B$$
145+
146+
A common use case is to compute a **similarity score** or confidence. For normalized vectors:
147+
148+
$$similarity = \frac{1 + A \cdot B}{2}$$
149+
150+
This metric ranges from 0 to 1, with 1 being as similar as possible, making it useful for applying thresholds.
151+
152+
### Computing Distances in DQL
153+
154+
Here's an example query that computes euclidean, cosine, and dot product distances:
155+
156+
```dql
157+
query slownessWithLogs($vec: float32vector) {
158+
simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
159+
uid
160+
Issue.description
161+
vemb as Issue.vector_embedding
162+
163+
euclidean_distance: Math(sqrt(($vec - vemb) dot ($vec - vemb)))
164+
165+
dotproduct_distance: Math(1.0 - (($vec) dot vemb))
166+
167+
cosine as Math((($vec) dot vemb) / sqrt((($vec) dot ($vec)) * (vemb dot vemb)))
168+
cosine_distance: Math(1.0 - cosine)
169+
170+
similarity_score: Math((1.0 + (($vec) dot vemb)) / 2.0)
171+
}
172+
}
173+
```
174+
175+
You typically compute the same distance as defined in the index, or use the similarity score.
176+
177+
### Ordering Results by Similarity Score
178+
179+
The following query computes the similarity score in a variable and uses it to order the 3 closest nodes by similarity:
180+
181+
```dql
182+
query slownessWithLogs($vec: float32vector) {
183+
var(func: similar_to(Issue.vector_embedding, 3, $vec)) {
184+
vemb as Issue.vector_embedding
185+
score as Math((1.0 + (($vec) dot vemb)) / 2.0)
186+
}
187+
# score is now a map of uid -> similarity_score
188+
189+
simVec(func: uid(score), orderdesc: val(score)) {
190+
uid
191+
Issue.description
192+
score: val(score)
193+
}
194+
}
195+
```
196+
197+
## Summary
198+
199+
This guide demonstrates how to:
200+
- Define a schema with vector predicates and `hnsw` indexes
201+
- Insert data with vector embeddings
202+
- Perform similarity searches using the `similar_to()` function
203+
- Compute various distance metrics and similarity scores using the `dot` function
204+
205+
For production use cases, you would typically:
206+
1. Generate vector embeddings from your text/data using ML models (e.g., sentence transformers, OpenAI embeddings)
207+
2. Store these embeddings in Dgraph
208+
3. Use `similar_to()` to find semantically similar items
209+
4. Optionally compute similarity scores to filter or rank results
210+
211+
## Related Topics
212+
213+
- [DQL Schema](../dql/dql-schema) - Learn about DQL schema definitions
214+
- [DQL Query](../dql/query/dql-query) - Learn about DQL queries
215+
- [Predicate Indexing](../dql/predicate-indexing) - Learn about indexing options

0 commit comments

Comments
 (0)