Skip to content

Commit 166cd75

Browse files
author
dichenli
committed
1. Add region argument.
2. If sortColumn or PartitionKey fields are None, convert them to empty lists. 3. Fix bug: None timestamp in CreateTime causes export_to_metastore failure. 4. Avoid None database_prefix and table_prefix. 5. Readme updates and fixes. 6. Code style improvements.
1 parent 05953fe commit 166cd75

File tree

4 files changed

+200
-103
lines changed

4 files changed

+200
-103
lines changed

utilities/Hive_metastore_migration/README.md

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,10 @@ Below are instructions for using each of the migration workflows described above
134134
you created to point to the Hive metastore. It is used to extract the Hive JDBC
135135
connection information using the native Spark library.
136136

137+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
138+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
139+
If not provided, `us-east-1` is used as default.
140+
137141
- `--database-prefix` (optional) set to a string prefix that is applied to the
138142
database name created in AWS Glue Data Catalog. You can use it as a way
139143
to track the origin of the metadata, and avoid naming conflicts. The default
@@ -164,7 +168,8 @@ If the above solutions don't apply to your situation, you can choose to first
164168
migrate your Hive metastore to Amazon S3 objects as a staging area, then run an ETL
165169
job to import the metadata from S3 to the AWS Glue Data Catalog. To do this, you need to
166170
have a Spark 2.1.x cluster that can connect to your Hive metastore and export
167-
metadata to plain files on S3.
171+
metadata to plain files on S3. The Hive metastore to S3 migration can also run
172+
as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.
168173

169174
1. Make the MySQL connector jar available to the Spark cluster on the master and
170175
all worker nodes. Include the jar in the Spark driver class path as well
@@ -229,9 +234,12 @@ metadata to plain files on S3.
229234
Add the following parameters.
230235

231236
- `--mode` set to `from-s3`
232-
- `--database-input-path` set to the S3 path containing only databases.
233-
- `--table-input-path` set to the S3 path containing only tables.
234-
- `--partition-input-path` set to the S3 path containing only partitions.
237+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
238+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
239+
If not provided, `us-east-1` is used as default.
240+
- `--database-input-path` set to the S3 path containing only databases. For example: `s3://someBucket/output_path_from_previous_job/databases`
241+
- `--table-input-path` set to the S3 path containing only tables. For example: `s3://someBucket/output_path_from_previous_job/tables`
242+
- `--partition-input-path` set to the S3 path containing only partitions. For example: `s3://someBucket/output_path_from_previous_job/partitions`
235243

236244
Also, because there is no need to connect to any JDBC source, the job doesn't
237245
require any connections.
@@ -315,6 +323,9 @@ metadata to plain files on S3.
315323
directly to a jdbc Hive Metastore
316324
- `--connection-name` set to the name of the AWS Glue connection
317325
you created to point to the Hive metastore. It is the destination of the migration.
326+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
327+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
328+
If not provided, `us-east-1` is used as default.
318329
- `--database-names` set to a semi-colon(;) separated list of
319330
database names to export from Data Catalog.
320331

@@ -333,7 +344,10 @@ metadata to plain files on S3.
333344
instructions above. Since the destination is now an S3 bucket instead of a Hive metastore,
334345
no connections are required. In the job, add the following parameters:
335346

336-
- `--mode` set to `to-S3`, which means the migration is to S3.
347+
- `--mode` set to `to-s3`, which means the migration is to S3.
348+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
349+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
350+
If not provided, `us-east-1` is used as default.
337351
- `--database-names` set to a semi-colon(;) separated list of
338352
database names to export from Data Catalog.
339353
- `--output-path` set to the S3 destination path.
@@ -365,8 +379,7 @@ metadata to plain files on S3.
365379

366380
#### AWS Glue Data Catalog to another AWS Glue Data Catalog
367381

368-
Currently, you cannot access an AWS Glue Data Catalog in another account.
369-
However, you can migrate (copy) metadata from the Data Catalog in one account to another. The steps are:
382+
You can migrate (copy) metadata from the Data Catalog in one account to another. The steps are:
370383

371384
1. Enable cross-account access for an S3 bucket so that both source and target accounts can access it. See
372385
[the Amazon S3 documenation](http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html#example-bucket-policies-use-case-1)
@@ -379,7 +392,7 @@ However, you can migrate (copy) metadata from the Data Catalog in one account to
379392

380393
3. Upload the the following scripts to an S3 bucket accessible from the target AWS account to be updated:
381394

382-
export_from_datacatalog.py
395+
import_into_datacatalog.py
383396
hive_metastore_migration.py
384397

385398
4. In the source AWS account, create a job on the AWS Glue console to extract metadata from the AWS Glue Data Catalog to S3.
@@ -391,7 +404,10 @@ However, you can migrate (copy) metadata from the Data Catalog in one account to
391404

392405
Add the following parameters:
393406

394-
- `--mode` set to `to-S3`, which means the migration is to S3.
407+
- `--mode` set to `to-s3`, which means the migration is to S3.
408+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
409+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
410+
If not provided, `us-east-1` is used as default.
395411
- `--database-names` set to a semi-colon(;) separated list of
396412
database names to export from Data Catalog.
397413
- `--output-path` set to the S3 destination path that you configured with **cross-account access**.
@@ -407,10 +423,12 @@ However, you can migrate (copy) metadata from the Data Catalog in one account to
407423
Add the following parameters.
408424

409425
- `--mode` set to `from-s3`
426+
- `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`.
427+
You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region.
428+
If not provided, `us-east-1` is used as default.
410429
- `--database-input-path` set to the S3 path containing only databases.
411430
- `--table-input-path` set to the S3 path containing only tables.
412431
- `--partition-input-path` set to the S3 path containing only partitions.
413432

414433
6. (Optional) Manually delete the temporary files generated in the S3 folder. Also, remember to revoke the
415434
cross-account access if it's not needed anymore.
416-

utilities/Hive_metastore_migration/src/export_from_datacatalog.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
from hive_metastore_migration import *
2020

21-
CONNECTION_TYPE_NAME = "com.amazonaws.services.glue.connections.DataCatalogConnection"
21+
CONNECTION_TYPE_NAME = 'com.amazonaws.services.glue.connections.DataCatalogConnection'
2222

2323
def transform_catalog_to_df(dyf):
2424
return dyf.toDF()
@@ -50,7 +50,7 @@ def datacatalog_migrate_to_hive_metastore(sc, sql_context, databases, tables, pa
5050
hive_metastore.export_to_metastore()
5151

5252

53-
def read_databases_from_catalog(sql_context, glue_context, datacatalog_name, database_arr):
53+
def read_databases_from_catalog(sql_context, glue_context, datacatalog_name, database_arr, region):
5454
databases = None
5555
tables = None
5656
partitions = None
@@ -59,7 +59,9 @@ def read_databases_from_catalog(sql_context, glue_context, datacatalog_name, dat
5959

6060
dyf = glue_context.create_dynamic_frame.from_options(
6161
connection_type=CONNECTION_TYPE_NAME,
62-
connection_options={"catalog.name": datacatalog_name, "catalog.database": database})
62+
connection_options={'catalog.name': datacatalog_name,
63+
'catalog.database': database,
64+
'catalog.region': region})
6365

6466
df = transform_catalog_to_df(dyf)
6567

@@ -88,6 +90,7 @@ def main():
8890
parser.add_argument('--database-names', required=True, help='Semicolon-separated list of names of database in Datacatalog to export')
8991
parser.add_argument('-o', '--output-path', required=False, help='Output path, either local directory or S3 path')
9092
parser.add_argument('-c', '--connection-name', required=False, help='Glue Connection name for Hive metastore JDBC connection')
93+
parser.add_argument('-R', '--region', required=False, help='AWS region of source Glue DataCatalog, default to "us-east-1"')
9194

9295
options = get_options(parser, sys.argv)
9396
if options['mode'] == to_s3:
@@ -105,6 +108,8 @@ def main():
105108
else:
106109
raise AssertionError('unknown mode ' + options['mode'])
107110

111+
validate_aws_regions(options['region'])
112+
108113
# spark env
109114
(conf, sc, sql_context) = get_spark_env()
110115
glue_context = GlueContext(sc)
@@ -116,7 +121,8 @@ def main():
116121
sql_context=sql_context,
117122
glue_context=glue_context,
118123
datacatalog_name='datacatalog',
119-
database_arr=database_arr
124+
database_arr=database_arr,
125+
region=options.get('region') or 'us-east-1'
120126
)
121127

122128
if options['mode'] == to_s3:

0 commit comments

Comments
 (0)