SQL_Layoffs_Data_Cleaning_Project

A practical MySQL data integrity project demonstrating best practices for cleaning and standardizing messy data. Transformed raw layoff data into an analysis-ready format.

Layoffs Data Cleaning and Standardization Project

🎯 Project Goal

To perform a systematic ETL (Extract, Transform, Load) process on a raw dataset of over 2,000 global company layoffs. The objective was to ensure the data is accurate, consistent, and ready for advanced exploratory data analysis (EDA).

🛠️ Technologies & Skills

Database: MySQL
Key SQL Concepts Demonstrated:
- Data Integrity: Use of staging tables to protect raw data.
- Advanced Deduplication: Employing the ROW_NUMBER() Window Function with partitioning.
- Data Imputation: Using Self-Joins (UPDATE ... JOIN) to intelligently fill missing data.
- Data Standardization: Cleaning text fields (TRIM, LIKE) for consistency.
- Data Typing: Converting TEXT dates to the DATE data type (STR_TO_DATE).

💡 Cleaning and Transformation Process

The project followed a multi-step data cleaning pipeline executed entirely in SQL:

1. Staging and Duplicate Identification

Staging Tables: Created two staging tables (layoffs_staging and layoffs_staging2) to work non-destructively, preserving the original data source.
Precise Deduplication: The ROW_NUMBER() function was applied, partitioned by all columns, to assign a unique rank to identical rows. All rows with a rank (row_num) greater than 1 were successfully deleted, ensuring a 100% unique dataset.

2. Handling Missing and Inconsistent Data

Imputing Missing Industry: Converted all blank industry fields to NULL for consistent querying. A Self-Join was then used to cross-reference the company name; if a company had a non-null industry record elsewhere, that value was used to fill its corresponding NULL entries (e.g., filling 'Travel' for 'Airbnb').
Standardization:
- Unified all variants of "Crypto" (e.g., 'Crypto Currency') into a single 'Crypto' term.
- Used the TRIM(TRAILING '.') function to remove inconsistent trailing periods from country names like 'United States.'.

3. Data Type Correction

Date Conversion: The date field, imported as TEXT in the format MM/DD/YYYY, was converted to the proper DATE format using the STR_TO_DATE() function, and the table schema was updated using ALTER TABLE.

4. Final Cleanup

Unusable records (where both total_laid_off and percentage_laid_off were NULL) were removed, as they provided no analytical value.
The temporary row_num column, used only for the deduplication process, was dropped.

📁 SQL Script

The complete, fully commented SQL data cleaning script can be found in the repository: View 01_Data_Cleaning.sql

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Data		Data
SQL		SQL
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SQL_Layoffs_Data_Cleaning_Project

Layoffs Data Cleaning and Standardization Project

🎯 Project Goal

🛠️ Technologies & Skills

💡 Cleaning and Transformation Process

1. Staging and Duplicate Identification

2. Handling Missing and Inconsistent Data

3. Data Type Correction

4. Final Cleanup

📁 SQL Script

About

Uh oh!

Releases

Packages

swarajgadgul/SQL_Layoffs_Data_Cleaning_Project

Folders and files

Latest commit

History

Repository files navigation

SQL_Layoffs_Data_Cleaning_Project

Layoffs Data Cleaning and Standardization Project

🎯 Project Goal

🛠️ Technologies & Skills

💡 Cleaning and Transformation Process

1. Staging and Duplicate Identification

2. Handling Missing and Inconsistent Data

3. Data Type Correction

4. Final Cleanup

📁 SQL Script

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages