A practical MySQL data integrity project demonstrating best practices for cleaning and standardizing messy data. Transformed raw layoff data into an analysis-ready format.
To perform a systematic ETL (Extract, Transform, Load) process on a raw dataset of over 2,000 global company layoffs. The objective was to ensure the data is accurate, consistent, and ready for advanced exploratory data analysis (EDA).
- Database: MySQL
- Key SQL Concepts Demonstrated:
- Data Integrity: Use of staging tables to protect raw data.
- Advanced Deduplication: Employing the
ROW_NUMBER()Window Function with partitioning. - Data Imputation: Using Self-Joins (
UPDATE ... JOIN) to intelligently fill missing data. - Data Standardization: Cleaning text fields (
TRIM,LIKE) for consistency. - Data Typing: Converting
TEXTdates to theDATEdata type (STR_TO_DATE).
The project followed a multi-step data cleaning pipeline executed entirely in SQL:
- Staging Tables: Created two staging tables (
layoffs_stagingandlayoffs_staging2) to work non-destructively, preserving the original data source. - Precise Deduplication: The
ROW_NUMBER()function was applied, partitioned by all columns, to assign a unique rank to identical rows. All rows with a rank (row_num) greater than 1 were successfully deleted, ensuring a 100% unique dataset.
- Imputing Missing Industry: Converted all blank
industryfields toNULLfor consistent querying. A Self-Join was then used to cross-reference thecompanyname; if a company had a non-null industry record elsewhere, that value was used to fill its correspondingNULLentries (e.g., filling 'Travel' for 'Airbnb'). - Standardization:
- Unified all variants of "Crypto" (e.g., 'Crypto Currency') into a single 'Crypto' term.
- Used the
TRIM(TRAILING '.')function to remove inconsistent trailing periods from country names like 'United States.'.
- Date Conversion: The date field, imported as
TEXTin the formatMM/DD/YYYY, was converted to the properDATEformat using theSTR_TO_DATE()function, and the table schema was updated usingALTER TABLE.
- Unusable records (where both
total_laid_offandpercentage_laid_offwereNULL) were removed, as they provided no analytical value. - The temporary
row_numcolumn, used only for the deduplication process, was dropped.
The complete, fully commented SQL data cleaning script can be found in the repository:
View 01_Data_Cleaning.sql