Glassdoor 2024 Data Science Job Salary Prediction

📌 Overview

This project aims to predict data science job salaries using multiple modeling approaches, leveraging 2024 Glassdoor job data. The application provides salary predictions through both traditional machine learning methods and cutting-edge large language model (LLM) techniques, offering users a comprehensive comparison of different predictive modeling paradigms.

The project targets data science professionals, recruiters, and researchers interested in salary analysis and prediction methodologies. It addresses the challenge of accurately estimating data science salaries by incorporating various job-related features and exploring the potential of LLMs in regression tasks. The interactive Streamlit application makes salary prediction accessible to both technical and non-technical users.

👉 Fine-tuned LLaMA 3.1 QLoRA model is available here: YuITC/llama31-8b-ins-qlora-sft

🧑‍💻 Tech Stack

Selenium, Pandas, NumPy, Scikit-learn, XGBoost, Optuna, PyTorch, Transformers, PEFT (QLoRA), SFT, OpenAI SDK, Hugging Face, Weights & Biases

Data Collection & Processing: Selenium for web scraping, Pandas and NumPy for data manipulation
Machine Learning: Scikit-learn for preprocessing, XGBoost for gradient boosting, Optuna for hyperparameter optimization
Deep Learning & LLMs: PyTorch for deep learning framework, Transformers and PEFT for model handling, QLoRA for efficient fine-tuning
LLM Services: OpenAI GPT-5 SDK for few-shot prompting, Hugging Face for model hosting and fine-tuning
Visualization: Matplotlib and Seaborn for data visualization and analysis
Deployment: Streamlit for interactive web application, Weights & Biases for experiment tracking
Development Tools: Jupyter Notebooks for experimentation, Python 3.10 for development environment

⭐ Key Features

Comprehensive Data Pipeline: Crawled 2024 Glassdoor Data Science jobs and performed comprehensive preprocessing, EDA, and feature engineering
High-Performance ML Model: Achieved R² = 0.82 using XGBoost, significantly outperforming the Linear Regression baseline (R² = 0.71)
Advanced Hyperparameter Optimization: Enhanced XGBoost predictive performance and robustness through advanced hyperparameter tuning with Optuna
LLM Integration: Experimented with LLM-based regression approaches via few-shot prompting with OpenAI's GPT-5 SDK and supervised fine-tuning (SFT + QLoRA) with LLaMA 3.1
Multi-Model Comparison: Benchmarked model performance across traditional ML, few-shot LLMs, and fine-tuned LLMs
Interactive Web Application: Built a salary prediction Streamlit app based on the three modeling approaches for easy accessibility

⚙️ Installation & Usage

Prerequisites

Python 3.10
GPU support for LLM inference (recommended)
Required API keys: OPENAI_API_KEY, HF_TOKEN, WANDB_API_KEY

Installation Steps

Clone the repository

git clone https://github.com/YuITC/2024-DataScience-Salaries-Analysis.git
cd 2024-DataScience-Salaries-Analysis

Create and activate virtual environment

python -m venv venv
venv\Scripts\activate  # On Windows
# source venv/bin/activate  # On macOS/Linux

Install dependencies
```
pip install -r requirements.txt
```

Set up environment variables Create a .env file in the root directory:

OPENAI_API_KEY=your_openai_api_key
HF_TOKEN=your_hugging_face_token
WANDB_API_KEY=your_wandb_api_key

Usage

Run the Streamlit application
```
streamlit run app.py
```
Explore Jupyter notebooks (optional)
```
jupyter notebook notebooks/
```
Run data crawling (optional)
```
jupyter notebook crawler.ipynb
```

📂 Project Structure

├── app.py                                    # Main Streamlit application
├── crawler.ipynb                             # Web scraping notebook for Glassdoor data
├── requirements.txt                          # Python dependencies
├── LICENSE                                   # Project license
├── assets/                                   # Assets folder
│   ├── demo1.png
│   └── demo2.png
├── data/
│   ├── glassdoor_jobs.csv                    # Raw scraped job data
│   ├── data_EDA.csv                          # Processed data for EDA
│   ├── data_model.csv                        # Final dataset for model training
│   └── data_sft/                             # Fine-tuning datasets
│       ├── train.json
│       ├── val.json
│       └── test.json
├── notebooks/
│   ├── 1-Preprocessing-and-Observation.ipynb # Preprocessing notebook
│   ├── 2-EDA-and-Feature-engineering.ipynb   # EDA notebook
│   ├── 3-Machine-Learning-approach.ipynb     # ML approach notebook
│   ├── 4-LLM-Few-shots-prompting.ipynb       # LLM few-shots prompting notebook
│   └── 5-SFT-QLoRA.ipynb                     # SFT + QLoRA notebook
└── outputs/                                  # Model outputs and results
    └── optuna_xgboost/
        └── xgb_optuna_tuning.pkl             # Trained XGBoost model

📫 Contact

If you find this project useful, consider ⭐️ starring the repository or contributing to further improvements!

For any questions, feature requests, or collaboration opportunities, feel free to reach out: tainguyenphu2502@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Glassdoor 2024 Data Science Job Salary Prediction

Table of Contents

📌 Overview

🧑‍💻 Tech Stack

⭐ Key Features

⚙️ Installation & Usage

Prerequisites

Installation Steps

Usage

📂 Project Structure

📫 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.streamlit		.streamlit
assets		assets
data		data
notebooks		notebooks
outputs/optuna_xgboost		outputs/optuna_xgboost
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
crawler.ipynb		crawler.ipynb
requirements.txt		requirements.txt

License

YuITC/2024-DataScience-Salaries-Analysis

Folders and files

Latest commit

History

Repository files navigation

Glassdoor 2024 Data Science Job Salary Prediction

Table of Contents

📌 Overview

🧑‍💻 Tech Stack

⭐ Key Features

⚙️ Installation & Usage

Prerequisites

Installation Steps

Usage

📂 Project Structure

📫 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages