This project showcases my skills in building a complete ETL pipeline using AWS Glue and Amazon S3 to process, clean, and analyze retail data. It demonstrates my ability to work with cloud-based data workflows and deliver analytical insights from raw datasets.
In this project, I:
- Uploaded raw product and transaction data to Amazon S3
- Created AWS Glue classifiers, crawlers, and a Visual ETL job
- Performed an inner join on
Product ID
to combine transaction and product metadata - Used regex to clean currency values in the
Sales
column - Applied an aggregate transformation to calculate the average net sales grouped by product category and ship mode
- Stored the final output in Parquet format in an S3 output bucket
- Queried the output using S3 Select to confirm the transformation result
-
Data Upload
- Uploaded
transactions.csv
totransaction-files/
folder in an S3 bucket - Uploaded
product details.csv
toproduct-files/
folder in the same bucket
- Uploaded
-
AWS Glue Setup
-
Created a Glue database named
abc-retail
-
Created two classifiers for reading CSV files with headers:
txnClass
fortransactions.csv
cust_classifier
forproduct details.csv
-
Created two crawlers to import both datasets into the Glue Data Catalog
-
-
Visual ETL Job
-
Added Glue Data Catalog nodes for both tables
-
Joined the datasets on
Product ID
using an Inner Join -
Dropped one duplicate
Product ID
column -
Used Regex Extractor on
Sales
column to extract numeric values into a new column calledNetSales
-
Applied Aggregate transformation:
- Grouped by:
Product Category
,Ship Mode
- Aggregated field:
NetSales
- Aggregation function:
Average
- Grouped by:
-
Saved output to another S3 bucket in Parquet format with Snappy compression
-
-
Result Verification
-
Queried the
.parquet
output file using S3 Select with SQL:SELECT * FROM s3object
-
Confirmed correct aggregation with output:
Fashion Second Class 1274.702381
-
You can find all key screenshots in the project submission, including:
transactions.csv
— Raw sales transaction dataproduct details.csv
— Product metadataRetail Data Management Project.pdf
— Project prompt and instructions
This project was completed to demonstrate hands-on skills in AWS Glue ETL pipeline design, data cleaning using regex, aggregation logic, and S3-based querying for analytics use cases.
MIT License — see LICENSE.