|
| 1 | +# How to set up and use 3rd-party text embeddings for dense vector search in Elasticsearch |
| 2 | +This guide demonstrates how to deploy and use a text embedding model in Elasticsearch. The model will generate vector representations for text, enabling vector similarity (k-nearest neighbours) search. |
| 3 | + |
| 4 | +## Set up Elasticsearch |
| 5 | + |
| 6 | +### Elasticsearch from IBM Cloud |
| 7 | +If you are using Elasticsearch from IBM Cloud, please refer to [this guide](./ICD_Elasticsearch_install_and_setup.md) first to create an Elasticsearch instance and set up Kibana if you haven't already. |
| 8 | + |
| 9 | +### Elasticsearch on CloudPak |
| 10 | +Alternatively, if you want to install Elasticsearch on Kubernetes (ECK) in CloudPak, you need to follow [this guide](./watsonx_discovery_install_and_setup.md) first to set up Elasticsearch and Kibana. You can skip [Enable ELSER model (v2)](./watsonx_discovery_install_and_setup.md#enable-elser-model-v2) and any section beyond that in the guide. |
| 11 | + |
| 12 | +## Install the eland library |
| 13 | +Run the command below to install the [eland](https://github.com/elastic/eland) library. |
| 14 | +```bash |
| 15 | +python -m pip install "eland[pytorch]" |
| 16 | +``` |
| 17 | +This library allows us to pull and deploy a 3rd-party text embedding model to our Elasticsearch instance. |
| 18 | + |
| 19 | +> CAUTION: Open source and 3rd party models are not in scope of IBM or Elastic indemnity clauses. Customers must accept relevant terms and conditions to choose or bring their own models. Additionally, IBM has not assessed Elastic's supported multi-lingual models so any use of Elastic-supported models should be understood thoroughly both with respect to the terms of use for those models and the terms of use of all of the data that was used to train those models. |
| 20 | +
|
| 21 | +NOTE: As of the time this documentation was written, `eland` only supports Python 3.8, 3.9, and 3.10. Please refer to the eland library [compatibility section](https://github.com/elastic/eland?tab=readme-ov-file#compatibility) to make sure you're using compatible Python and Elasticsearch versions. |
| 22 | + |
| 23 | +NOTE: You can also use eland without installing the library in case you run into any issues with the library. This can be done by using the docker image provided [here](https://github.com/elastic/eland?tab=readme-ov-file#docker). |
| 24 | + |
| 25 | +## Create environment variables for ES credentials |
| 26 | +Feel free to customize the names of the `ES_SOURCE_INDEX_NAME`, `ES_EMBEDDING_INDEX_NAME` and `ES_PIPELINE_NAME` variables below. These names will serve as references for your source index, embedding index, and ingestion pipeline throughout this guide. |
| 27 | + ```bash |
| 28 | + export ES_URL=https://<hostname:port> |
| 29 | + export ES_USER=<username> |
| 30 | + export ES_PASSWORD=<password> |
| 31 | + export ES_CACERT=<path-to-your-cert> |
| 32 | + export ES_SOURCE_INDEX_NAME=<name-of-source-index> |
| 33 | + export ES_EMBEDDING_INDEX_NAME=<name-of-embedding-index> |
| 34 | + export ES_PIPELINE_NAME=<name-of-ingest-pipeline> |
| 35 | + ``` |
| 36 | +You can find the credentials from the service credentials of your Elasticsearch instance. |
| 37 | +## Pull and deploy an embedding model |
| 38 | +Run the command below to pull your desired model from the [Huggingface Models Hub](https://huggingface.co/models) and deploy it on your Elasticsearch instance: |
| 39 | +```bash |
| 40 | +eland_import_hub_model \ |
| 41 | + --url $ES_URL \ |
| 42 | + -u $ES_USER -p $ES_PASSWORD --insecure \ |
| 43 | + --hub-model-id intfloat/multilingual-e5-small \ |
| 44 | + --task-type text_embedding \ |
| 45 | + --start |
| 46 | +``` |
| 47 | + |
| 48 | +In this example, we are using the `multilingual-e5-small` model which is a multi-lingual model that supports text embeddings in 100 languages. You can read more about this model [here](https://huggingface.co/intfloat/multilingual-e5-small) |
| 49 | + |
| 50 | +## Synchronize your deployed model |
| 51 | +Go to the **Machine Learning > Trained Models** page http://localhost:5601/app/ml/trained_models and synchronize your trained models. A warning message is displayed at the top of the page that says "ML job and trained model synchronization required". Follow the link to "Synchronize your jobs and trained models." Then click Synchronize. |
| 52 | + |
| 53 | +<img src="assets/synchronize_trained_model.png"/> |
| 54 | + |
| 55 | +Once you synchronize your model you should see your deployed model on the **Machine Learning > Model Management** page in Kibana. |
| 56 | + |
| 57 | + |
| 58 | + |
| 59 | +## Test your deployed model |
| 60 | +Run the command below to test the model using the _infer API |
| 61 | +```bash |
| 62 | +curl -X POST "${ES_URL}/_ml/trained_models/intfloat__multilingual-e5-small/_infer" -u "${ES_USER}:${ES_PASSWORD}" -H "Content-Type: application/json" --cacert $ES_CACERT -d '{ |
| 63 | + "docs": { |
| 64 | + "text_field": "how to set up custom extension?" |
| 65 | + } |
| 66 | +}' |
| 67 | +``` |
| 68 | +You should see a response containing the predicted embedding vector. |
| 69 | + |
| 70 | +```bash |
| 71 | +{ |
| 72 | + "inference_results": [ |
| 73 | + { |
| 74 | + "predicted_value": [ |
| 75 | + 0.016921168193221092, |
| 76 | + -0.035475824028253555, |
| 77 | + -0.0497407428920269, |
| 78 | + ... |
| 79 | +``` |
| 80 | +
|
| 81 | +## Load sample data |
| 82 | +Refer to the [Load data into Elasticsearch](./ICD_Elasticsearch_install_and_setup.md#load-data-into-elasticsearch) section in the Elasticsearch setup guide to upload a sample data to Elasticsearch using Kabana. |
| 83 | +
|
| 84 | +## Add your embedding model to an inference ingest pipeline |
| 85 | +Create an ingest pipeline using the command below: |
| 86 | +```bash |
| 87 | +curl -X PUT "${ES_URL}/_ingest/pipeline/${ES_PIPELINE_NAME}" \ |
| 88 | + -u "${ES_USER}:${ES_PASSWORD}" --cacert "${ES_CACERT}"\ |
| 89 | + -H 'Content-Type: application/json' -d '{ |
| 90 | + "description": "Text embedding pipeline", |
| 91 | + "processors": [ |
| 92 | + { |
| 93 | + "inference": { |
| 94 | + "model_id": "intfloat__multilingual-e5-small", |
| 95 | + "target_field": "text_embedding", |
| 96 | + "field_map": { |
| 97 | + "text": "text_field" |
| 98 | + } |
| 99 | + } |
| 100 | + } |
| 101 | + ], |
| 102 | + "on_failure": [ |
| 103 | + { |
| 104 | + "set": { |
| 105 | + "description": "Index document to '\''failed-<index>'\''", |
| 106 | + "field": "_index", |
| 107 | + "value": "failed-{{{_index}}}" |
| 108 | + } |
| 109 | + }, |
| 110 | + { |
| 111 | + "set": { |
| 112 | + "description": "Set error message", |
| 113 | + "field": "ingest.failure", |
| 114 | + "value": "{{_ingest.on_failure_message}}" |
| 115 | + } |
| 116 | + } |
| 117 | + ] |
| 118 | +}' |
| 119 | +``` |
| 120 | +
|
| 121 | +You can verify that the ingest pipeline was created by locating it in the list of your ingest pipelines on Kibana http://localhost:5601/app/management/ingest/ingest_pipelines |
| 122 | +
|
| 123 | +## Create a mapping for the destination index containing the embeddings |
| 124 | +Then run the command below to create the mappings of the destination index called `ES_EMBEDDING_INDEX_NAME`: |
| 125 | +```bash |
| 126 | +curl -X PUT "${ES_URL}/${ES_EMBEDDING_INDEX_NAME}" \ |
| 127 | + -u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT \ |
| 128 | + -H 'Content-Type: application/json' -d '{ |
| 129 | + "mappings": { |
| 130 | + "properties": { |
| 131 | + "text_embedding.predicted_value": { |
| 132 | + "type": "dense_vector", |
| 133 | + "dims": 384, |
| 134 | + "index": true, |
| 135 | + "similarity": "cosine" |
| 136 | + }, |
| 137 | + "text": { |
| 138 | + "type": "text" |
| 139 | + } |
| 140 | + } |
| 141 | + } |
| 142 | +}' |
| 143 | +``` |
| 144 | +
|
| 145 | +* `text_embedding.predicted_value` is the field where the ingest processor stores the embeddings |
| 146 | +* `dims` is the embedding size of the deployed model which is 384 for the `intfloat/multilingual-e5-small` model we are using here |
| 147 | +
|
| 148 | +## Create the text embeddings |
| 149 | +Run the ingest pipeline to reindex the data to the `ES_EMBEDDING_INDEX_NAME` index |
| 150 | +```bash |
| 151 | +curl -X POST "${ES_URL}/_reindex?wait_for_completion=false" \ |
| 152 | + -u "${ES_USER}:${ES_PASSWORD}" --cacert "$ES_CACERT" \ |
| 153 | + -H 'Content-Type: application/json' -d "{ |
| 154 | + \"source\": { |
| 155 | + \"index\": \"${ES_SOURCE_INDEX_NAME}\", |
| 156 | + \"size\": 50 |
| 157 | + }, |
| 158 | + \"dest\": { |
| 159 | + \"index\": \"${ES_EMBEDDING_INDEX_NAME}\", |
| 160 | + \"pipeline\": \"${ES_PIPELINE_NAME}\" |
| 161 | + } |
| 162 | +}" |
| 163 | +``` |
| 164 | +
|
| 165 | +This command will return a task id that looks like this: |
| 166 | +```json |
| 167 | +{"task":<task-id>} |
| 168 | +``` |
| 169 | +
|
| 170 | +The reindexing process can take around 10 minutes. You can use the task id that is returned above to check the status of the process. |
| 171 | +
|
| 172 | +```bash |
| 173 | +curl -X GET "${ES_URL}/_tasks/<task-id>" \ |
| 174 | + -u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT |
| 175 | +``` |
| 176 | +
|
| 177 | +You can check the completion status by monitoring the `"completed"` field in the response: |
| 178 | +
|
| 179 | +```bash |
| 180 | +{ |
| 181 | + "completed": true, |
| 182 | + ... |
| 183 | +} |
| 184 | +``` |
| 185 | +
|
| 186 | +Once the process is completed, you should see `ES_EMBEDDING_INDEX_NAME` in the list of your indices http://localhost:5601/app/enterprise_search/content/search_indices |
| 187 | +
|
| 188 | +You can confirm the successful completion of this step by checking the `ES_EMBEDDING_INDEX_NAME` index. If you find the `text_embedding` column filled with embedding vectors as shown below, it indicates that the process was successful: |
| 189 | +```bash |
| 190 | +{ |
| 191 | + "predicted_value": [ |
| 192 | + -0.016909973695874214, |
| 193 | + -0.05246243625879288, |
| 194 | + -0.02864678204059601, |
| 195 | + ... |
| 196 | + ], |
| 197 | + "model_id": "intfloat__multilingual-e5-small" |
| 198 | +} |
| 199 | +``` |
| 200 | +## Run semantic search |
| 201 | +After the dataset has been enriched with vector embeddings, you can query the data using semantic search. |
| 202 | +```bash |
| 203 | +curl -X GET "${ES_URL}/${ES_EMBEDDING_INDEX_NAME}/_search" \ |
| 204 | + -u "${ES_USER}:${ES_PASSWORD}" --cacert $ES_CACERT \ |
| 205 | + -H 'Content-Type: application/json' -d '{ |
| 206 | + "knn": { |
| 207 | + "field": "text_embedding.predicted_value", |
| 208 | + "query_vector_builder": { |
| 209 | + "text_embedding": { |
| 210 | + "model_id": "intfloat__multilingual-e5-small", |
| 211 | + "model_text": "how to set up custom extension?" |
| 212 | + } |
| 213 | + }, |
| 214 | + "k": 10, |
| 215 | + "num_candidates": 100 |
| 216 | + }, |
| 217 | + "_source": [ |
| 218 | + "id", |
| 219 | + "text" |
| 220 | + ] |
| 221 | +}' |
| 222 | +``` |
0 commit comments