Skip to content

Commit aa62be3

Browse files
mattkjames7matea16
andauthored
Embedding improvements (#1450)
* add link to cuda container toolkit * update embeddings page * updated function names * update callout * updated embeddings page * Update pages/advanced-algorithms/available-algorithms/embeddings.mdx --------- Co-authored-by: matea16 <mateapesic@hotmail.com> Co-authored-by: Matea Pesic <80577904+matea16@users.noreply.github.com>
1 parent 51faf0a commit aa62be3

File tree

2 files changed

+135
-22
lines changed

2 files changed

+135
-22
lines changed

pages/advanced-algorithms/available-algorithms/embeddings.mdx

Lines changed: 132 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ description: Calculate sentence embeddings on node strings using pytorch.
77

88
import { Cards } from 'nextra/components'
99
import GitHub from '/components/icons/GitHub'
10+
import { Callout } from 'nextra/components'
1011

1112
The embeddings module provides tools for calculating sentence embeddings on node strings using pytorch.
1213

@@ -27,38 +28,59 @@ The embeddings module provides tools for calculating sentence embeddings on node
2728

2829
## Procedures
2930

30-
### `compute()`
31+
### `node_sentence()`
3132

3233
The procedure computes the sentence embeddings on the string properties of nodes. Embeddings are
3334
created as a property of the nodes in the graph.
3435

3536
{<h4 className="custom-header"> Input: </h4>}
3637

3738
- `input_nodes: List[Vertex]` (**OPTIONAL**) ➡ The list of nodes to compute the embeddings for. If not provided, the embeddings are computed for all nodes in the graph.
38-
- `embedding_property: string` ➡ The name of the property to store the embeddings in. This property is `embedding` by default.
39-
- `excluded_properties: List[string]` ➡ The list of properties to exclude from the embeddings computation. This list is empty by default.
40-
- `model_name: string` ➡ The name of the model to use for the embeddings computation, buy default this module uses the `all-MiniLM-L6-v2` model provided by the `sentence-transformers` library.
41-
- `batch_size: int` ➡ The batch size to use for the embeddings computation. This is set to `2000` by default.
42-
- `chunk_size: int` ➡ The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. This is set to 48 by default.
43-
- `device: string|int|List[string|int]` ➡ The device to use for the embeddings computation. This can be any of the following:
39+
- `configuration`: (`mgp.Map`, **OPTIONAL**): User defined parameters from query module. Defaults to `{}`.
40+
41+
**Configuration options:**
42+
43+
| Name | Type | Default | Description |
44+
|----------------------------|--------------|-------------------|----------------------------------------------------------------------------------------------------------|
45+
| `embedding_property` | string | `"embedding"` | The name of the property to store the embeddings in. |
46+
| `excluded_properties` | List[string] | `[]` | The list of properties to exclude from the embeddings computation. |
47+
| `model_name` | string | `"all-MiniLM-L6-v2"` | The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
48+
| `return_embeddings` | bool | `False` | Whether to return the embeddings as an additional output or not. |
49+
| `batch_size` | int | `2000` | The batch size to use for the embeddings computation. |
50+
| `chunk_size` | int | `48` | The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. |
51+
| `device` | NULL\|string\| int\|List[string\|int] | `NULL` | The device to use for the embeddings computation (see below). |
52+
53+
<Callout type="info">
54+
The `device` parameter can be one of the following:
55+
- `NULL` (default) - Use first GPU if available, otherwise use CPU.
4456
- `"cpu"` - Use CPU for computation.
4557
- `"cuda"` or `"all"` - Use all available CUDA devices for computation.
4658
- `"cuda:id"` - Use a specific CUDA device for computation.
4759
- `id` - Use a specific device for computation.
4860
- `[id1, id2, ...]` - Use a list of device ids for computation.
4961
- `["cuda:id1", "cuda:id2", ...]` - Use a list of CUDA devices for computation.
50-
by default, the first device (`0`) is used.
62+
63+
**Note**: If you're running on a GPU device, make sure to start your container
64+
with the `--gpus=all` flag.
65+
For more details, see the [Install MAGE
66+
documentation](/advanced-algorithms/install-mage).
67+
</Callout>
68+
5169

5270
{<h4 className="custom-header"> Output: </h4>}
5371

5472
- `success: bool` ➡ Whether the embeddings computation was successful.
73+
- `embeddings: List[List[float]]|NULL` ➡ The list of embeddings. Only returned if the
74+
`return_embeddings` parameter is set to `true` in the configuration, otherwise `NULL`.
75+
- `dimension: int` ➡ The dimension of the embeddings.
5576

5677
{<h4 className="custom-header"> Usage: </h4>}
5778

58-
To compute the embeddings across the entire graph with the default parameters, use the following query:
79+
To compute the embeddings across the entire graph with the default parameters,
80+
use the following query:
5981

6082
```cypher
61-
CALL embeddings.compute()
83+
CALL embeddings.node_sentence()
6284
YIELD success;
6385
```
6486

@@ -70,25 +92,79 @@ MATCH (n)
7092
WITH n ORDER BY id(n)
7193
LIMIT 5
7294
WITH collect(n) AS subset
73-
CALL embeddings.compute(subset)
95+
CALL embeddings.node_sentence(subset)
7496
YIELD success;
7597
```
7698

7799
To run the computation on specific device(s), use the following query:
78100

79101
```cypher
80-
CALL embeddings.compute(
81-
NULL,
82-
"embedding",
83-
NULL,
84-
"all-MiniLM-L6-v2",
85-
2000,
86-
48,
87-
"cuda:1"
88-
)
102+
WITH {device: "cuda:1"} AS configuration
103+
CALL embeddings.node_sentence(NULL, configuration)
89104
YIELD success;
90105
```
91106

107+
To return the embeddings as an additional output, use the following query:
108+
109+
```cypher
110+
WITH {return_embeddings: True} AS configuration
111+
CALL embeddings.node_sentence(NULL, configuration)
112+
YIELD success, embeddings;
113+
```
114+
115+
116+
### `text()`
117+
118+
This procedure can be used to return a list of embeddings when given a list of strings.
119+
120+
{<h4 className="custom-header"> Input: </h4>}
121+
122+
- `strings: List[string]` ➡ The list of strings to compute the embeddings for.
123+
- `configuration: mgp.Map` (**OPTIONAL**) ➡ User defined parameters from query module. Defaults to `{}`.
124+
125+
**Configuration options:**
126+
127+
| Name | Type | Default | Description |
128+
|----------------------------|--------------|-------------------|----------------------------------------------------------------------------------------------------------|
129+
| `model_name` | string | `"all-MiniLM-L6-v2"` | The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
130+
| `batch_size` | int | `2000` | The batch size to use for the embeddings computation. |
131+
| `chunk_size` | int | `48` | The number of batches per "chunk". This is used when computing embeddings across multiple GPUs, as this has to be done by spawning multiple processes. Each spawned process computes the embeddings for a single chunk. |
132+
| `device` | NULL\|string\| int\|List[string\|int] | `NULL` | The device to use for the embeddings computation. |
133+
134+
135+
{<h4 className="custom-header"> Output: </h4>}
136+
137+
- `success: bool` ➡ Whether the embeddings computation was successful.
138+
- `embeddings: List[List[float]]` ➡ The list of embeddings.
139+
- `dimension: int` ➡ The dimension of the embeddings.
140+
141+
{<h4 className="custom-header"> Usage: </h4>}
142+
143+
To compute the embeddings for a list of strings, use the following query:
144+
145+
```cypher
146+
CALL embeddings.text(["Hello", "World"])
147+
YIELD success, embeddings;
148+
```
149+
150+
### `model_info()`
151+
152+
The procedure returns the information about the model used for the embeddings computation.
153+
154+
{<h4 className="custom-header"> Input: </h4>}
155+
156+
- `configuration: mgp.Map` (**OPTIONAL**) ➡ User defined parameters from query module. Defaults to `{}`.
157+
The key `model_name` is used to specify the name of the model to use for the embeddings computation.
158+
159+
{<h4 className="custom-header"> Output: </h4>}
160+
161+
- `model_info: mgp.Map` ➡ The information about the model used for the embeddings computation.
162+
163+
| Name | Type | Default | Description |
164+
|----------------------------|--------------|-------------------|----------------------------------------------------------------------------------------------------------|
165+
| `model_name` | string | `"all-MiniLM-L6-v2"` | The name of the model to use for the embeddings computation, provided by the `sentence-transformers` library. |
166+
| `dimension` | int | `384` | The dimension of the embeddings. |
167+
| `max_seq_length` | int | `256` | The maximum sequence length. |
92168

93169
## Example
94170

@@ -106,7 +182,7 @@ CREATE (a:Node {id: 1, Title: "Stilton", Description: "A stinky cheese from the
106182
Run the following query to compute the embeddings:
107183

108184
```cypher
109-
CALL embeddings.compute()
185+
CALL embeddings.node_sentence()
110186
YIELD success;
111187
112188
MATCH (n)
@@ -132,4 +208,38 @@ Results:
132208
| "Parmesan" | [-0.0755439, 0.00906182, -0.010977, 0.0208911, -0.0527448, 0.0085... |
133209
| "Red Leicester" | [-0.0244318, -0.0280038, -0.0373183, 0.0284436, -0.0277753, 0.066... |
134210
+----------------------------------------------------------------------+----------------------------------------------------------------------+
135-
```
211+
```
212+
213+
To compute the embeddings for a list of strings, use the following query:
214+
215+
```cypher
216+
CALL embeddings.text(["Hello", "World"])
217+
YIELD success, embeddings;
218+
```
219+
220+
Results:
221+
222+
```plaintext
223+
+----------------------------------------------------------+----------------------------------------------------------------------------------+
224+
| success | embeddings |
225+
+----------------------------------------------------------+----------------------------------------------------------------------------------+
226+
| true | [[-0.0627718, 0.0549588, 0.0521648, 0.08579, -0.0827489, -0.074573, 0.0685547... |
227+
+----------------------------------------------------------+----------------------------------------------------------------------------------+
228+
```
229+
230+
To get the information about the model used for the embeddings computation, use the following query:
231+
232+
```cypher
233+
CALL embeddings.model_info()
234+
YIELD info;
235+
```
236+
237+
Results:
238+
239+
```plaintext
240+
+----------------------------------------------------------------------------+
241+
| info |
242+
+----------------------------------------------------------------------------+
243+
| {dimension: 384, max_sequence_length: 256, model_name: "all-MiniLM-L6-v2"} |
244+
+----------------------------------------------------------------------------+
245+
```

pages/advanced-algorithms/install-mage.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ The following tags are available on Docker Hub:
3636
- `x.y-relwithdebinfo-cuda` - Memgraph built with CUDA support* - available since version `3.6.1`.
3737

3838
*To run GPU-accelerated algorithms, you need to launch the container with the `--gpus all` flag.
39+
This requires the installation of NVIDIA Container Toolkit. See the
40+
[NVIDIA Container Toolkit documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
41+
for more details.
3942

4043
For versions prior to `3.2`, MAGE image tags included both MAGE and Memgraph versions, e.g.
4144

0 commit comments

Comments
 (0)