Skip to content

Commit ea1aafb

Browse files
committed
resolved conflicts for aws-samples#13
2 parents 5b61691 + d96c303 commit ea1aafb

File tree

3 files changed

+59
-23
lines changed

3 files changed

+59
-23
lines changed

utilities/Hive_metastore_migration/README.md

Lines changed: 13 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -191,35 +191,26 @@ as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.
191191
2. Submit the `hive_metastore_migration.py` Spark script to your Spark cluster
192192
using the following parameters:
193193

194-
- Set `--direction` to `from_metastore`, or omit the argument since
195-
`from_metastore` is the default.
194+
- Set `--config_file` to `<path_to_your_config_yaml_file>` (default path: `artifacts/config.yaml`)
195+
196+
- Provide the following configuration parameters in the configuration yaml file:
197+
```
198+
* mode
199+
* jdbc-url
200+
* jdbc-username
201+
* jdbc-password
202+
* database-prefix
203+
* table-prefix
204+
```
196205

197-
- Provide the JDBC connection information through these arguments:
198-
`--jdbc-url`, `--jdbc-username`, and `--jdbc-password`.
199-
200-
- The argument `--output-path` is required. It is either a local file system location
201-
or an S3 location. If the output path is a local directory, you can upload the data
202-
to an S3 location manually. If it is an S3 path, you need to make sure that the Spark
203-
cluster has EMRFS library in its class path. The script will export the metadata to a
204-
subdirectory of the output-path you provided.
205-
206-
- `--database-prefix` and `--table-prefix` (optional) to set a string prefix that is applied to the
207-
database and table names. They are empty by default.
208-
209206
- Example spark-submit command to migrate Hive metastore to S3, tested on EMR-4.7.1:
210-
```bash
207+
```bash
211208
MYSQL_JAR_PATH=/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar
212209
DRIVER_CLASSPATH=/home/hadoop/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:$MYSQL_JAR_PATH
213210
spark-submit --driver-class-path $DRIVER_CLASSPATH \
214211
--jars $MYSQL_JAR_PATH \
215212
/home/hadoop/hive_metastore_migration.py \
216-
--mode from-metastore \
217-
--jdbc-url jdbc:mysql://metastore.foo.us-east-1.rds.amazonaws.com:3306 \
218-
--jdbc-user hive \
219-
--jdbc-password myJDBCPassword \
220-
--database-prefix myHiveMetastore_ \
221-
--table-prefix myHiveMetastore_ \
222-
--output-path s3://mybucket/myfolder/
213+
--config_file artifacts/config.yaml
223214
```
224215

225216
- If the job finishes successfully, it creates 3 sub-folders in the S3 output path you
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
mode:
2+
jdbc-url:
3+
jdbc-username:
4+
jdbc-password:
5+
database-prefix:
6+
table-prefix:
7+
output-path:
8+
input_path:

utilities/Hive_metastore_migration/src/hive_metastore_migration.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
# except for python 2.7 standard library and Spark 2.1
1010
import sys
1111
from datetime import datetime, timedelta, tzinfo
12+
import yaml
1213
from time import localtime, strftime
1314
from types import MethodType
1415

@@ -1606,6 +1607,39 @@ def parse_arguments(args):
16061607
return options
16071608

16081609

1610+
def parse_arguments_from_yaml_file(args):
1611+
"""
1612+
This function accepts the path to a config file
1613+
and extracts the needed arguments for the metastore migration
1614+
----------
1615+
Return:
1616+
Dictionary of config options
1617+
"""
1618+
parser = argparse.ArgumentParser(prog=args[0])
1619+
parser.add_argument('-f', '--config_file', required=True, default='artifacts/config.yaml`', help='Provide yaml configuration file path to read migration arguments from. Default path: `artifacts/config.yaml`')
1620+
options = get_options(parser, args)
1621+
config_file_path = options['config_file']
1622+
## read the yaml file
1623+
with open(config_file_path, 'r') as yaml_file_stream:
1624+
config_options = yaml.load(yaml_file_stream)
1625+
1626+
if config_options['mode'] == FROM_METASTORE:
1627+
validate_options_in_mode(
1628+
options=config_options, mode=FROM_METASTORE,
1629+
required_options=['output_path'],
1630+
not_allowed_options=['input_path']
1631+
)
1632+
elif config_options['mode'] == TO_METASTORE:
1633+
validate_options_in_mode(
1634+
options=config_options, mode=TO_METASTORE,
1635+
required_options=['input_path'],
1636+
not_allowed_options=['output_path']
1637+
)
1638+
else:
1639+
raise AssertionError('unknown mode ' + options['mode'])
1640+
1641+
return config_options
1642+
16091643
def get_spark_env():
16101644
try:
16111645
sc = SparkContext.getOrCreate()
@@ -1733,7 +1767,10 @@ def validate_aws_regions(region):
17331767

17341768

17351769
def main():
1736-
options = parse_arguments(sys.argv)
1770+
# options = parse_arguments(sys.argv)
1771+
1772+
## This now reads options from path to config yaml file
1773+
options = parse_arguments_from_yaml_file(sys.argv)
17371774

17381775
connection = {"url": options["jdbc_url"], "user": options["jdbc_username"], "password": options["jdbc_password"]}
17391776
db_prefix = options.get("database_prefix") or ""

0 commit comments

Comments
 (0)