Project File: https://colab.research.google.com/drive/1zZhyFMuXrXDgkCO20zYk9zwvYontkQ7S?usp=sharing Note: The dataset has been acquired from kaggle.com and has been uploaded to this repo.
This project implements a prediction model using Linear Regression to estimate an employee's salary.
The model uses multiple features to capture different aspects of an employee's profile:
- Experience
- Tenure
- Session Index
- Gender Index
- Location
- Tech Stack
Employe_Performance_dataset.csv
→ Dataset containing employee features and salary informationREADME.md
→ Project documentation and link to Jupyter Notebook with complete ML workflow (data preprocessing, model training, evaluation)
This project follows an end-to-end machine learning pipeline for regression tasks:
- Load the dataset (
Employe_Performance_dataset.csv
). - Inspect data structure, features, and target (Salary).
- Handle missing values (imputation or removal).
- Remove duplicates and fix inconsistencies.
- Encode categorical variables (
GenderIdx
,Location
,Tech Stack
). - Standardize/normalize numerical features (
Experience
,Tenure
,Session_Idx
).
- Visualize salary distribution.
- Study correlations between features and salary.
- Detect outliers in salary or experience data.
- Create derived features if beneficial (e.g., interaction terms).
- Scale numerical features for regression stability.
- Apply encoding methods (Label/One-Hot Encoding) for categorical data.
- Train-Test split (commonly 80:20).
- Ensure randomization for unbiased evaluation.
- Train Linear Regression on preprocessed features.
- Fit the model using Ordinary Least Squares (OLS).
- Interpret regression coefficients to understand feature importance.
Evaluate predictions with regression performance metrics:
- Mean Absolute Error (MAE) – average absolute errors.
- Mean Squared Error (MSE) – penalizes larger errors.
- Root Mean Squared Error (RMSE) – interpretable in salary units.
- R² Score (Coefficient of Determination) – variance explained by the model.
- Apply k-fold cross-validation to check robustness.
- Compare Linear Regression with baseline models (Ridge, Lasso, Decision Tree Regressor).
- Save the trained model using
joblib
orpickle
. - Deploy as an API using Flask/Django.
- Build a simple frontend for interactive salary prediction.
- Programming Language: Python
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
- Environment: Jupyter Notebook
- Successfully trained a Linear Regression model to predict employee salary.
- Evaluated performance with multiple regression metrics.
- Established a reliable baseline for future experimentation with advanced models.
- Add more advanced models (Random Forest, Gradient Boosting, XGBoost).
- Perform hyperparameter tuning for improved accuracy.
- Incorporate additional employee attributes for better prediction.
- Deploy as a full-stack application for HR analytics.
This project demonstrates the end-to-end regression pipeline — from raw data collection to model evaluation — for predicting employee salaries. It provides a solid foundation for building advanced salary prediction systems in real-world HR applications.