Skip to content

A practical MySQL data integrity project demonstrating best practices for cleaning and standardizing messy data. Transformed raw layoff data into an analysis-ready format.

Notifications You must be signed in to change notification settings

swarajgadgul/SQL_Layoffs_Data_Cleaning_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SQL_Layoffs_Data_Cleaning_Project

A practical MySQL data integrity project demonstrating best practices for cleaning and standardizing messy data. Transformed raw layoff data into an analysis-ready format.

Layoffs Data Cleaning and Standardization Project

🎯 Project Goal

To perform a systematic ETL (Extract, Transform, Load) process on a raw dataset of over 2,000 global company layoffs. The objective was to ensure the data is accurate, consistent, and ready for advanced exploratory data analysis (EDA).

πŸ› οΈ Technologies & Skills

  • Database: MySQL
  • Key SQL Concepts Demonstrated:
    • Data Integrity: Use of staging tables to protect raw data.
    • Advanced Deduplication: Employing the ROW_NUMBER() Window Function with partitioning.
    • Data Imputation: Using Self-Joins (UPDATE ... JOIN) to intelligently fill missing data.
    • Data Standardization: Cleaning text fields (TRIM, LIKE) for consistency.
    • Data Typing: Converting TEXT dates to the DATE data type (STR_TO_DATE).

πŸ’‘ Cleaning and Transformation Process

The project followed a multi-step data cleaning pipeline executed entirely in SQL:

1. Staging and Duplicate Identification

  • Staging Tables: Created two staging tables (layoffs_staging and layoffs_staging2) to work non-destructively, preserving the original data source.
  • Precise Deduplication: The ROW_NUMBER() function was applied, partitioned by all columns, to assign a unique rank to identical rows. All rows with a rank (row_num) greater than 1 were successfully deleted, ensuring a 100% unique dataset.

2. Handling Missing and Inconsistent Data

  • Imputing Missing Industry: Converted all blank industry fields to NULL for consistent querying. A Self-Join was then used to cross-reference the company name; if a company had a non-null industry record elsewhere, that value was used to fill its corresponding NULL entries (e.g., filling 'Travel' for 'Airbnb').
  • Standardization:
    • Unified all variants of "Crypto" (e.g., 'Crypto Currency') into a single 'Crypto' term.
    • Used the TRIM(TRAILING '.') function to remove inconsistent trailing periods from country names like 'United States.'.

3. Data Type Correction

  • Date Conversion: The date field, imported as TEXT in the format MM/DD/YYYY, was converted to the proper DATE format using the STR_TO_DATE() function, and the table schema was updated using ALTER TABLE.

4. Final Cleanup

  • Unusable records (where both total_laid_off and percentage_laid_off were NULL) were removed, as they provided no analytical value.
  • The temporary row_num column, used only for the deduplication process, was dropped.

πŸ“ SQL Script

The complete, fully commented SQL data cleaning script can be found in the repository: View 01_Data_Cleaning.sql

About

A practical MySQL data integrity project demonstrating best practices for cleaning and standardizing messy data. Transformed raw layoff data into an analysis-ready format.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published