Skip to content

Commit d96c303

Browse files
committed
Replaced specifying options to yaml config file
1 parent 166cd75 commit d96c303

File tree

3 files changed

+59
-23
lines changed

3 files changed

+59
-23
lines changed

utilities/Hive_metastore_migration/README.md

Lines changed: 13 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -186,35 +186,26 @@ as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.
186186
2. Submit the `hive_metastore_migration.py` Spark script to your Spark cluster
187187
using the following parameters:
188188

189-
- Set `--direction` to `from_metastore`, or omit the argument since
190-
`from_metastore` is the default.
189+
- Set `--config_file` to `<path_to_your_config_yaml_file>` (default path: `artifacts/config.yaml`)
190+
191+
- Provide the following configuration parameters in the configuration yaml file:
192+
```
193+
* mode
194+
* jdbc-url
195+
* jdbc-username
196+
* jdbc-password
197+
* database-prefix
198+
* table-prefix
199+
```
191200

192-
- Provide the JDBC connection information through these arguments:
193-
`--jdbc-url`, `--jdbc-username`, and `--jdbc-password`.
194-
195-
- The argument `--output-path` is required. It is either a local file system location
196-
or an S3 location. If the output path is a local directory, you can upload the data
197-
to an S3 location manually. If it is an S3 path, you need to make sure that the Spark
198-
cluster has EMRFS library in its class path. The script will export the metadata to a
199-
subdirectory of the output-path you provided.
200-
201-
- `--database-prefix` and `--table-prefix` (optional) to set a string prefix that is applied to the
202-
database and table names. They are empty by default.
203-
204201
- Example spark-submit command to migrate Hive metastore to S3, tested on EMR-4.7.1:
205-
```bash
202+
```bash
206203
MYSQL_JAR_PATH=/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar
207204
DRIVER_CLASSPATH=/home/hadoop/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:$MYSQL_JAR_PATH
208205
spark-submit --driver-class-path $DRIVER_CLASSPATH \
209206
--jars $MYSQL_JAR_PATH \
210207
/home/hadoop/hive_metastore_migration.py \
211-
--mode from-metastore \
212-
--jdbc-url jdbc:mysql://metastore.foo.us-east-1.rds.amazonaws.com:3306 \
213-
--jdbc-user hive \
214-
--jdbc-password myJDBCPassword \
215-
--database-prefix myHiveMetastore_ \
216-
--table-prefix myHiveMetastore_ \
217-
--output-path s3://mybucket/myfolder/
208+
--config_file artifacts/config.yaml
218209
```
219210

220211
- If the job finishes successfully, it creates 3 sub-folders in the S3 output path you
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
mode:
2+
jdbc-url:
3+
jdbc-username:
4+
jdbc-password:
5+
database-prefix:
6+
table-prefix:
7+
output-path:
8+
input_path:

utilities/Hive_metastore_migration/src/hive_metastore_migration.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
# except for python 2.7 standard library and Spark 2.1
1515
import sys
1616
import argparse
17+
import yaml
1718
import re
1819
import logging
1920
from time import localtime, strftime
@@ -1398,6 +1399,39 @@ def parse_arguments(args):
13981399
return options
13991400

14001401

1402+
def parse_arguments_from_yaml_file(args):
1403+
"""
1404+
This function accepts the path to a config file
1405+
and extracts the needed arguments for the metastore migration
1406+
----------
1407+
Return:
1408+
Dictionary of config options
1409+
"""
1410+
parser = argparse.ArgumentParser(prog=args[0])
1411+
parser.add_argument('-f', '--config_file', required=True, default='artifacts/config.yaml`', help='Provide yaml configuration file path to read migration arguments from. Default path: `artifacts/config.yaml`')
1412+
options = get_options(parser, args)
1413+
config_file_path = options['config_file']
1414+
## read the yaml file
1415+
with open(config_file_path, 'r') as yaml_file_stream:
1416+
config_options = yaml.load(yaml_file_stream)
1417+
1418+
if config_options['mode'] == FROM_METASTORE:
1419+
validate_options_in_mode(
1420+
options=config_options, mode=FROM_METASTORE,
1421+
required_options=['output_path'],
1422+
not_allowed_options=['input_path']
1423+
)
1424+
elif config_options['mode'] == TO_METASTORE:
1425+
validate_options_in_mode(
1426+
options=config_options, mode=TO_METASTORE,
1427+
required_options=['input_path'],
1428+
not_allowed_options=['output_path']
1429+
)
1430+
else:
1431+
raise AssertionError('unknown mode ' + options['mode'])
1432+
1433+
return config_options
1434+
14011435
def get_spark_env():
14021436
conf = SparkConf()
14031437
sc = SparkContext(conf=conf)
@@ -1501,7 +1535,10 @@ def validate_aws_regions(region):
15011535

15021536

15031537
def main():
1504-
options = parse_arguments(sys.argv)
1538+
# options = parse_arguments(sys.argv)
1539+
1540+
## This now reads options from path to config yaml file
1541+
options = parse_arguments_from_yaml_file(sys.argv)
15051542

15061543
connection = {
15071544
'url': options['jdbc_url'],

0 commit comments

Comments
 (0)