You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -7,6 +7,7 @@ description: Calculate sentence embeddings on node strings using pytorch.
7
7
8
8
import { Cards } from'nextra/components'
9
9
importGitHubfrom'/components/icons/GitHub'
10
+
import { Callout } from'nextra/components'
10
11
11
12
The embeddings module provides tools for calculating sentence embeddings on node strings using pytorch.
12
13
@@ -27,38 +28,59 @@ The embeddings module provides tools for calculating sentence embeddings on node
27
28
28
29
## Procedures
29
30
30
-
### `compute()`
31
+
### `node_sentence()`
31
32
32
33
The procedure computes the sentence embeddings on the string properties of nodes. Embeddings are
33
34
created as a property of the nodes in the graph.
34
35
35
36
{<h4className="custom-header"> Input: </h4>}
36
37
37
38
-`input_nodes: List[Vertex]` (**OPTIONAL**) ➡ The list of nodes to compute the embeddings for. If not provided, the embeddings are computed for all nodes in the graph.
38
-
-`embedding_property: string` ➡ The name of the property to store the embeddings in. This property is `embedding` by default.
39
-
-`excluded_properties: List[string]` ➡ The list of properties to exclude from the embeddings computation. This list is empty by default.
40
-
-`model_name: string` ➡ The name of the model to use for the embeddings computation, buy default this module uses the `all-MiniLM-L6-v2` model provided by the `sentence-transformers` library.
41
-
-`batch_size: int` ➡ The batch size to use for the embeddings computation. This is set to `2000` by default.
42
-
-`chunk_size: int` ➡ The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. This is set to 48 by default.
43
-
-`device: string|int|List[string|int]` ➡ The device to use for the embeddings computation. This can be any of the following:
39
+
-`configuration`: (`mgp.Map`, **OPTIONAL**): User defined parameters from query module. Defaults to `{}`.
|`embedding_property`| string |`"embedding"`| The name of the property to store the embeddings in. |
46
+
|`excluded_properties`| List[string]|`[]`| The list of properties to exclude from the embeddings computation. |
47
+
|`model_name`| string |`"all-MiniLM-L6-v2"`| The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
48
+
|`return_embeddings`| bool |`False`| Whether to return the embeddings as an additional output or not. |
49
+
|`batch_size`| int |`2000`| The batch size to use for the embeddings computation. |
50
+
|`chunk_size`| int |`48`| The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. |
51
+
|`device`| NULL\|string\| int\|List[string\|int]|`NULL`| The device to use for the embeddings computation (see below). |
52
+
53
+
<Callouttype="info">
54
+
The `device` parameter can be one of the following:
55
+
- `NULL` (default) - Use first GPU if available, otherwise use CPU.
44
56
- `"cpu"` - Use CPU for computation.
45
57
- `"cuda"` or `"all"` - Use all available CUDA devices for computation.
46
58
- `"cuda:id"` - Use a specific CUDA device for computation.
47
59
- `id` - Use a specific device for computation.
48
60
- `[id1, id2, ...]` - Use a list of device ids for computation.
49
61
- `["cuda:id1", "cuda:id2", ...]` - Use a list of CUDA devices for computation.
50
-
by default, the first device (`0`) is used.
62
+
63
+
**Note**: If you're running on a GPU device, make sure to start your container
|`model_name`| string |`"all-MiniLM-L6-v2"`| The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
130
+
|`batch_size`| int |`2000`| The batch size to use for the embeddings computation. |
131
+
|`chunk_size`| int |`48`| The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. |
132
+
|`device`| NULL\|string\| int\|List[string\|int]|`NULL`| The device to use for the embeddings computation. |
133
+
134
+
135
+
{<h4className="custom-header"> Output: </h4>}
136
+
137
+
-`success: bool` ➡ Whether the embeddings computation was successful.
138
+
-`embeddings: List[List[float]]` ➡ The list of embeddings.
139
+
-`dimension: int` ➡ The dimension of the embeddings.
140
+
141
+
{<h4className="custom-header"> Usage: </h4>}
142
+
143
+
To compute the embeddings for a list of strings, use the following query:
144
+
145
+
```cypher
146
+
CALL embeddings.text(["Hello", "World"])
147
+
YIELD success, embeddings;
148
+
```
149
+
150
+
### `model_info()`
151
+
152
+
The procedure returns the information about the model used for the embeddings computation.
153
+
154
+
{<h4className="custom-header"> Input: </h4>}
155
+
156
+
-`configuration: mgp.Map` (**OPTIONAL**) ➡ User defined parameters from query module. Defaults to `{}`.
157
+
The key `model_name` is used to specify the name of the model to use for the embeddings computation.
158
+
159
+
{<h4className="custom-header"> Output: </h4>}
160
+
161
+
-`model_info: mgp.Map` ➡ The information about the model used for the embeddings computation.
|`model_name`| string |`"all-MiniLM-L6-v2"`| The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
166
+
|`dimension`| int |`384`| The dimension of the embeddings. |
167
+
|`max_seq_length`| int |`256`| The maximum sequence length. |
92
168
93
169
## Example
94
170
@@ -106,7 +182,7 @@ CREATE (a:Node {id: 1, Title: "Stilton", Description: "A stinky cheese from the
106
182
Run the following query to compute the embeddings:
0 commit comments