This script removes duplicates from an Elasticsearch index based on a set of fields.
- Python 3.x
- Elasticsearch 7.x (or compatible version)
python3 deduplicate_elasticsearch.py [-h] -i INDEX_NAME -k KEYS [-e ES_HOST] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
-h, --help: show help message and exit-i INDEX_NAME, --index-name INDEX_NAME: name of the Elasticsearch index-k KEYS, --keys KEYS: comma-separated list of fields to use for determining duplicates-e ES_HOST, --es-host ES_HOST: Elasticsearch host (default: http://localhost:9200)-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}: logging level (default: INFO)
python3 deduplicate_elasticsearch.py -i my_index -k first_name,last_name -e http://localhost:9200 -l DEBUG
This will remove duplicates from the my_index Elasticsearch index based on the first_name and last_name fields. The Elasticsearch server is assumed to be running on http://localhost:9200. The script will output debug messages to the console.
This project is licensed under the MIT License. See the LICENSE file for details.