🛒🛍️✨Retail Data Management with AWS Glue

This project showcases my skills in building a complete ETL pipeline using AWS Glue and Amazon S3 to process, clean, and analyze retail data. It demonstrates my ability to work with cloud-based data workflows and deliver analytical insights from raw datasets.

📌 Project Summary

In this project, I:

Uploaded raw product and transaction data to Amazon S3
Created AWS Glue classifiers, crawlers, and a Visual ETL job
Performed an inner join on Product ID to combine transaction and product metadata
Used regex to clean currency values in the Sales column
Applied an aggregate transformation to calculate the average net sales grouped by product category and ship mode
Stored the final output in Parquet format in an S3 output bucket
Queried the output using S3 Select to confirm the transformation result

✅ Steps Performed

Data Upload
- Uploaded transactions.csv to transaction-files/ folder in an S3 bucket
- Uploaded product details.csv to product-files/ folder in the same bucket
AWS Glue Setup
- Created a Glue database named abc-retail
- Created two classifiers for reading CSV files with headers:
  - txnClass for transactions.csv
  - cust_classifier for product details.csv
- Created two crawlers to import both datasets into the Glue Data Catalog
Visual ETL Job
- Added Glue Data Catalog nodes for both tables
- Joined the datasets on Product ID using an Inner Join
- Dropped one duplicate Product ID column
- Used Regex Extractor on Sales column to extract numeric values into a new column called NetSales
- Applied Aggregate transformation:
  - Grouped by: Product Category, Ship Mode
  - Aggregated field: NetSales
  - Aggregation function: Average
- Saved output to another S3 bucket in Parquet format with Snappy compression
Result Verification
- Queried the .parquet output file using S3 Select with SQL:
```
SELECT * FROM s3object
```
- Confirmed correct aggregation with output:
```
Fashion	Second Class	1274.702381
```

🖼️ Screenshots

You can find all key screenshots in the project submission, including:

Step	Screenshot Description
Visual ETL Canvas	Full layout of the ETL job with connected nodes
Join Node Settings	Join keys selected on `Product ID`
Regex Extractor Settings	Regex used to clean `Sales` column
Aggregate Settings	Grouping and average aggregation on NetSales
S3 Select Query	Output query result confirming correct ETL flow

📄 Files Included

transactions.csv — Raw sales transaction data
product details.csv — Product metadata
Retail Data Management Project.pdf — Project prompt and instructions

📝 Project Purpose

This project was completed to demonstrate hands-on skills in AWS Glue ETL pipeline design, data cleaning using regex, aggregation logic, and S3-based querying for analytics use cases.

📜 License

MIT License — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛒🛍️✨Retail Data Management with AWS Glue

📌 Project Summary

✅ Steps Performed

🖼️ Screenshots

📄 Files Included

📝 Project Purpose

📜 License

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
LICENSE		LICENSE
README.md		README.md
Retail Data Management Project.pdf		Retail Data Management Project.pdf
product details.csv		product details.csv
transactions.csv		transactions.csv

License

analyticshamza/Retail-Data-Management-with-AWS-Glue

Folders and files

Latest commit

History

Repository files navigation

🛒🛍️✨Retail Data Management with AWS Glue

📌 Project Summary

✅ Steps Performed

🖼️ Screenshots

📄 Files Included

📝 Project Purpose

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages