Skip to content

Commit f543cda

Browse files
authored
feat: add setup guide for dense vector search (#246)
* feat: update open-api spec to include knn * feat: add instructions for deploying embedding model and setting up dense vector search * chore: remove extra image * chore: move text embedding guide to a separate doc * fix: address comments * chore: address comments * fix: add embedding index creation success check * feat: use env variables for index and ingest pipeline names * feat: add pointer to elasticsearch on cloudpak setup
1 parent c62267a commit f543cda

File tree

5 files changed

+271
-11
lines changed

5 files changed

+271
-11
lines changed
67.1 KB
Loading
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# How to set up and use 3rd-party text embeddings for dense vector search in Elasticsearch
2+
This guide demonstrates how to deploy and use a text embedding model in Elasticsearch. The model will generate vector representations for text, enabling vector similarity (k-nearest neighbours) search.
3+
4+
## Set up Elasticsearch
5+
6+
### Elasticsearch from IBM Cloud
7+
If you are using Elasticsearch from IBM Cloud, please refer to [this guide](./ICD_Elasticsearch_install_and_setup.md) first to create an Elasticsearch instance and set up Kibana if you haven't already.
8+
9+
### Elasticsearch on CloudPak
10+
Alternatively, if you want to install Elasticsearch on Kubernetes (ECK) in CloudPak, you need to follow [this guide](./watsonx_discovery_install_and_setup.md) first to set up Elasticsearch and Kibana. You can skip [Enable ELSER model (v2)](./watsonx_discovery_install_and_setup.md#enable-elser-model-v2) and any section beyond that in the guide.
11+
12+
## Install the eland library
13+
Run the command below to install the [eland](https://github.com/elastic/eland) library.
14+
```bash
15+
python -m pip install "eland[pytorch]"
16+
```
17+
This library allows us to pull and deploy a 3rd-party text embedding model to our Elasticsearch instance.
18+
19+
> CAUTION: Open source and 3rd party models are not in scope of IBM or Elastic indemnity clauses. Customers must accept relevant terms and conditions to choose or bring their own models. Additionally, IBM has not assessed Elastic's supported multi-lingual models so any use of Elastic-supported models should be understood thoroughly both with respect to the terms of use for those models and the terms of use of all of the data that was used to train those models.
20+
21+
NOTE: As of the time this documentation was written, `eland` only supports Python 3.8, 3.9, and 3.10. Please refer to the eland library [compatibility section](https://github.com/elastic/eland?tab=readme-ov-file#compatibility) to make sure you're using compatible Python and Elasticsearch versions.
22+
23+
NOTE: You can also use eland without installing the library in case you run into any issues with the library. This can be done by using the docker image provided [here](https://github.com/elastic/eland?tab=readme-ov-file#docker).
24+
25+
## Create environment variables for ES credentials
26+
Feel free to customize the names of the `ES_SOURCE_INDEX_NAME`, `ES_EMBEDDING_INDEX_NAME` and `ES_PIPELINE_NAME` variables below. These names will serve as references for your source index, embedding index, and ingestion pipeline throughout this guide.
27+
```bash
28+
export ES_URL=https://<hostname:port>
29+
export ES_USER=<username>
30+
export ES_PASSWORD=<password>
31+
export ES_CACERT=<path-to-your-cert>
32+
export ES_SOURCE_INDEX_NAME=<name-of-source-index>
33+
export ES_EMBEDDING_INDEX_NAME=<name-of-embedding-index>
34+
export ES_PIPELINE_NAME=<name-of-ingest-pipeline>
35+
```
36+
You can find the credentials from the service credentials of your Elasticsearch instance.
37+
## Pull and deploy an embedding model
38+
Run the command below to pull your desired model from the [Huggingface Models Hub](https://huggingface.co/models) and deploy it on your Elasticsearch instance:
39+
```bash
40+
eland_import_hub_model \
41+
--url $ES_URL \
42+
-u $ES_USER -p $ES_PASSWORD --insecure \
43+
--hub-model-id intfloat/multilingual-e5-small \
44+
--task-type text_embedding \
45+
--start
46+
```
47+
48+
In this example, we are using the `multilingual-e5-small` model which is a multi-lingual model that supports text embeddings in 100 languages. You can read more about this model [here](https://huggingface.co/intfloat/multilingual-e5-small)
49+
50+
## Synchronize your deployed model
51+
Go to the **Machine Learning > Trained Models** page http://localhost:5601/app/ml/trained_models and synchronize your trained models. A warning message is displayed at the top of the page that says "ML job and trained model synchronization required". Follow the link to "Synchronize your jobs and trained models." Then click Synchronize.
52+
53+
<img src="assets/synchronize_trained_model.png"/>
54+
55+
Once you synchronize your model you should see your deployed model on the **Machine Learning > Model Management** page in Kibana.
56+
57+
58+
59+
## Test your deployed model
60+
Run the command below to test the model using the _infer API
61+
```bash
62+
curl -X POST "${ES_URL}/_ml/trained_models/intfloat__multilingual-e5-small/_infer" -u "${ES_USER}:${ES_PASSWORD}" -H "Content-Type: application/json" --cacert $ES_CACERT -d '{
63+
"docs": {
64+
"text_field": "how to set up custom extension?"
65+
}
66+
}'
67+
```
68+
You should see a response containing the predicted embedding vector.
69+
70+
```bash
71+
{
72+
"inference_results": [
73+
{
74+
"predicted_value": [
75+
0.016921168193221092,
76+
-0.035475824028253555,
77+
-0.0497407428920269,
78+
...
79+
```
80+
81+
## Load sample data
82+
Refer to the [Load data into Elasticsearch](./ICD_Elasticsearch_install_and_setup.md#load-data-into-elasticsearch) section in the Elasticsearch setup guide to upload a sample data to Elasticsearch using Kabana.
83+
84+
## Add your embedding model to an inference ingest pipeline
85+
Create an ingest pipeline using the command below:
86+
```bash
87+
curl -X PUT "${ES_URL}/_ingest/pipeline/${ES_PIPELINE_NAME}" \
88+
-u "${ES_USER}:${ES_PASSWORD}" --cacert "${ES_CACERT}"\
89+
-H 'Content-Type: application/json' -d '{
90+
"description": "Text embedding pipeline",
91+
"processors": [
92+
{
93+
"inference": {
94+
"model_id": "intfloat__multilingual-e5-small",
95+
"target_field": "text_embedding",
96+
"field_map": {
97+
"text": "text_field"
98+
}
99+
}
100+
}
101+
],
102+
"on_failure": [
103+
{
104+
"set": {
105+
"description": "Index document to '\''failed-<index>'\''",
106+
"field": "_index",
107+
"value": "failed-{{{_index}}}"
108+
}
109+
},
110+
{
111+
"set": {
112+
"description": "Set error message",
113+
"field": "ingest.failure",
114+
"value": "{{_ingest.on_failure_message}}"
115+
}
116+
}
117+
]
118+
}'
119+
```
120+
121+
You can verify that the ingest pipeline was created by locating it in the list of your ingest pipelines on Kibana http://localhost:5601/app/management/ingest/ingest_pipelines
122+
123+
## Create a mapping for the destination index containing the embeddings
124+
Then run the command below to create the mappings of the destination index called `ES_EMBEDDING_INDEX_NAME`:
125+
```bash
126+
curl -X PUT "${ES_URL}/${ES_EMBEDDING_INDEX_NAME}" \
127+
-u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT \
128+
-H 'Content-Type: application/json' -d '{
129+
"mappings": {
130+
"properties": {
131+
"text_embedding.predicted_value": {
132+
"type": "dense_vector",
133+
"dims": 384,
134+
"index": true,
135+
"similarity": "cosine"
136+
},
137+
"text": {
138+
"type": "text"
139+
}
140+
}
141+
}
142+
}'
143+
```
144+
145+
* `text_embedding.predicted_value` is the field where the ingest processor stores the embeddings
146+
* `dims` is the embedding size of the deployed model which is 384 for the `intfloat/multilingual-e5-small` model we are using here
147+
148+
## Create the text embeddings
149+
Run the ingest pipeline to reindex the data to the `ES_EMBEDDING_INDEX_NAME` index
150+
```bash
151+
curl -X POST "${ES_URL}/_reindex?wait_for_completion=false" \
152+
-u "${ES_USER}:${ES_PASSWORD}" --cacert "$ES_CACERT" \
153+
-H 'Content-Type: application/json' -d "{
154+
\"source\": {
155+
\"index\": \"${ES_SOURCE_INDEX_NAME}\",
156+
\"size\": 50
157+
},
158+
\"dest\": {
159+
\"index\": \"${ES_EMBEDDING_INDEX_NAME}\",
160+
\"pipeline\": \"${ES_PIPELINE_NAME}\"
161+
}
162+
}"
163+
```
164+
165+
This command will return a task id that looks like this:
166+
```json
167+
{"task":<task-id>}
168+
```
169+
170+
The reindexing process can take around 10 minutes. You can use the task id that is returned above to check the status of the process.
171+
172+
```bash
173+
curl -X GET "${ES_URL}/_tasks/<task-id>" \
174+
-u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT
175+
```
176+
177+
You can check the completion status by monitoring the `"completed"` field in the response:
178+
179+
```bash
180+
{
181+
"completed": true,
182+
...
183+
}
184+
```
185+
186+
Once the process is completed, you should see `ES_EMBEDDING_INDEX_NAME` in the list of your indices http://localhost:5601/app/enterprise_search/content/search_indices
187+
188+
You can confirm the successful completion of this step by checking the `ES_EMBEDDING_INDEX_NAME` index. If you find the `text_embedding` column filled with embedding vectors as shown below, it indicates that the process was successful:
189+
```bash
190+
{
191+
"predicted_value": [
192+
-0.016909973695874214,
193+
-0.05246243625879288,
194+
-0.02864678204059601,
195+
...
196+
],
197+
"model_id": "intfloat__multilingual-e5-small"
198+
}
199+
```
200+
## Run semantic search
201+
After the dataset has been enriched with vector embeddings, you can query the data using semantic search.
202+
```bash
203+
curl -X GET "${ES_URL}/${ES_EMBEDDING_INDEX_NAME}/_search" \
204+
-u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT \
205+
-H 'Content-Type: application/json' -d '{
206+
"knn": {
207+
"field": "text_embedding.predicted_value",
208+
"query_vector_builder": {
209+
"text_embedding": {
210+
"model_id": "intfloat__multilingual-e5-small",
211+
"model_text": "how to set up custom extension?"
212+
}
213+
},
214+
"k": 10,
215+
"num_candidates": 100
216+
},
217+
"_source": [
218+
"id",
219+
"text"
220+
]
221+
}'
222+
```

integrations/extensions/starter-kits/elasticsearch/README.md

Lines changed: 35 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -52,20 +52,44 @@ but you will need to find set-up instructions appropriate to that environment.
5252
```
5353
NOTE: Learn more about ELSER v1 from [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.10/semantic-search-elser.html)
5454
* Semantic search with ELSER v2
55-
```json
56-
{
57-
"text_expansion": {
58-
"content_embedding": {
59-
"model_id": ".elser_model_2",
60-
"model_text": "how to set up a custom extension?"
55+
```json
56+
{
57+
"text_expansion": {
58+
"content_embedding": {
59+
"model_id": ".elser_model_2",
60+
"model_text": "how to set up a custom extension?"
61+
}
6162
}
62-
}
63-
}
64-
```
65-
NOTE: Learn more about ELSER v2 from [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.11/semantic-search-elser.html).
66-
ELSER v2 is only available for the 8.11 version of Elasticsearch
63+
}
64+
```
65+
NOTE: Learn more about ELSER v2 from [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.11/semantic-search-elser.html).
66+
ELSER v2 is only available for the 8.11 version of Elasticsearch
67+
68+
* To use dense vector search (k-nearest neighbours search), you can set `knn_body` as a session variable and set the `knn` variable to `knn_body` as shown below:
69+
70+
Here is an example knn body you can use when setting up the `knn_body` session variable:
71+
72+
```json
73+
{
74+
"field": "text_embedding.predicted_value",
75+
"query_vector_builder": {
76+
"text_embedding": {
77+
"model_id": "intfloat__multilingual-e5-small",
78+
"model_text": "how to set up custom extension?"
79+
}
80+
},
81+
"k": 10,
82+
"num_candidates": 100
83+
}
84+
```
85+
86+
NOTE: `intfloat__multilingual-e5-small` is a multilingual embedding model supported by Elasticsearch. If you have Elasticsearch version 8.11 or earlier, you will need to deploy this model to your Elasticsearch cluster first before starting to use it by following the instructions [here](/integrations/extensions/docs/elasticsearch-install-and-setup/text_embedding_deploy_and_use.md).
87+
88+
<img src="assets/use_elasticsearch_custom_extension_knn.png" width="669" height="627" />
89+
6790
* Compound search
6891
You can combine different types of queries in a compound query. Learn more about it from this [Elasticsearch tutorial](https://www.elastic.co/guide/en/elasticsearch/reference/8.10/semantic-search-elser.html#text-expansion-compound-query).
92+
6993
* Try typing in anything in your preview chat to trigger `No action matches` action.
7094
If you see a successful extension call with valid response in the Extension Inspector, your Elasticsearch custom extension has been set up successfully.
7195

70.9 KB
Loading

integrations/extensions/starter-kits/elasticsearch/elasticsearch-generic-openapi.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@
5555
},
5656
"query": {
5757
"type": "object"
58+
},
59+
"knn": {
60+
"type": "object"
5861
}
5962
}
6063
},
@@ -66,6 +69,17 @@
6669
"model_text": "tell me about a custom extension"
6770
}
6871
}
72+
},
73+
"knn": {
74+
"field": "text_embedding.predicted_value",
75+
"query_vector_builder": {
76+
"text_embedding": {
77+
"model_id": "intfloat__multilingual-e5-small",
78+
"model_text": "how to set up custom extension?"
79+
}
80+
},
81+
"k": 10,
82+
"num_candidates": 100
6983
}
7084
}
7185
}

0 commit comments

Comments
 (0)