Skip to content

πŸ“Š Analyze real-world data on Data Science job salaries, benchmarking prediction performance using multiple approaches: traditional ML models, few-shot prompting, and fine-tuned LLMs.

License

Notifications You must be signed in to change notification settings

YuITC/2024-DataScience-Salaries-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Glassdoor 2024 Data Science Job Salary Prediction

Table of Contents

πŸ“Œ Overview

Demo 1 Demo 2

This project aims to predict data science job salaries using multiple modeling approaches, leveraging 2024 Glassdoor job data. The application provides salary predictions through both traditional machine learning methods and cutting-edge large language model (LLM) techniques, offering users a comprehensive comparison of different predictive modeling paradigms.

The project targets data science professionals, recruiters, and researchers interested in salary analysis and prediction methodologies. It addresses the challenge of accurately estimating data science salaries by incorporating various job-related features and exploring the potential of LLMs in regression tasks. The interactive Streamlit application makes salary prediction accessible to both technical and non-technical users.

πŸ‘‰ Fine-tuned LLaMA 3.1 QLoRA model is available here: YuITC/llama31-8b-ins-qlora-sft

πŸ§‘β€πŸ’» Tech Stack

Selenium, Pandas, NumPy, Scikit-learn, XGBoost, Optuna, PyTorch, Transformers, PEFT (QLoRA), SFT, OpenAI SDK, Hugging Face, Weights & Biases

  • Data Collection & Processing: Selenium for web scraping, Pandas and NumPy for data manipulation
  • Machine Learning: Scikit-learn for preprocessing, XGBoost for gradient boosting, Optuna for hyperparameter optimization
  • Deep Learning & LLMs: PyTorch for deep learning framework, Transformers and PEFT for model handling, QLoRA for efficient fine-tuning
  • LLM Services: OpenAI GPT-5 SDK for few-shot prompting, Hugging Face for model hosting and fine-tuning
  • Visualization: Matplotlib and Seaborn for data visualization and analysis
  • Deployment: Streamlit for interactive web application, Weights & Biases for experiment tracking
  • Development Tools: Jupyter Notebooks for experimentation, Python 3.10 for development environment

⭐ Key Features

  • Comprehensive Data Pipeline: Crawled 2024 Glassdoor Data Science jobs and performed comprehensive preprocessing, EDA, and feature engineering
  • High-Performance ML Model: Achieved RΒ² = 0.82 using XGBoost, significantly outperforming the Linear Regression baseline (RΒ² = 0.71)
  • Advanced Hyperparameter Optimization: Enhanced XGBoost predictive performance and robustness through advanced hyperparameter tuning with Optuna
  • LLM Integration: Experimented with LLM-based regression approaches via few-shot prompting with OpenAI's GPT-5 SDK and supervised fine-tuning (SFT + QLoRA) with LLaMA 3.1
  • Multi-Model Comparison: Benchmarked model performance across traditional ML, few-shot LLMs, and fine-tuned LLMs
  • Interactive Web Application: Built a salary prediction Streamlit app based on the three modeling approaches for easy accessibility

βš™οΈ Installation & Usage

Prerequisites

  • Python 3.10
  • GPU support for LLM inference (recommended)
  • Required API keys: OPENAI_API_KEY, HF_TOKEN, WANDB_API_KEY

Installation Steps

  1. Clone the repository

    git clone https://github.com/YuITC/2024-DataScience-Salaries-Analysis.git
    cd 2024-DataScience-Salaries-Analysis
  2. Create and activate virtual environment

    python -m venv venv
    venv\Scripts\activate  # On Windows
    # source venv/bin/activate  # On macOS/Linux
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up environment variables Create a .env file in the root directory:

    OPENAI_API_KEY=your_openai_api_key
    HF_TOKEN=your_hugging_face_token
    WANDB_API_KEY=your_wandb_api_key

Usage

  1. Run the Streamlit application

    streamlit run app.py
  2. Explore Jupyter notebooks (optional)

    jupyter notebook notebooks/
  3. Run data crawling (optional)

    jupyter notebook crawler.ipynb

πŸ“‚ Project Structure

β”œβ”€β”€ app.py                                    # Main Streamlit application
β”œβ”€β”€ crawler.ipynb                             # Web scraping notebook for Glassdoor data
β”œβ”€β”€ requirements.txt                          # Python dependencies
β”œβ”€β”€ LICENSE                                   # Project license
β”œβ”€β”€ assets/                                   # Assets folder
β”‚   β”œβ”€β”€ demo1.png
β”‚   └── demo2.png
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ glassdoor_jobs.csv                    # Raw scraped job data
β”‚   β”œβ”€β”€ data_EDA.csv                          # Processed data for EDA
β”‚   β”œβ”€β”€ data_model.csv                        # Final dataset for model training
β”‚   └── data_sft/                             # Fine-tuning datasets
β”‚       β”œβ”€β”€ train.json
β”‚       β”œβ”€β”€ val.json
β”‚       └── test.json
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 1-Preprocessing-and-Observation.ipynb # Preprocessing notebook
β”‚   β”œβ”€β”€ 2-EDA-and-Feature-engineering.ipynb   # EDA notebook
β”‚   β”œβ”€β”€ 3-Machine-Learning-approach.ipynb     # ML approach notebook
β”‚   β”œβ”€β”€ 4-LLM-Few-shots-prompting.ipynb       # LLM few-shots prompting notebook
β”‚   └── 5-SFT-QLoRA.ipynb                     # SFT + QLoRA notebook
└── outputs/                                  # Model outputs and results
    └── optuna_xgboost/
        └── xgb_optuna_tuning.pkl             # Trained XGBoost model

πŸ“« Contact

If you find this project useful, consider ⭐️ starring the repository or contributing to further improvements!

For any questions, feature requests, or collaboration opportunities, feel free to reach out: tainguyenphu2502@gmail.com

About

πŸ“Š Analyze real-world data on Data Science job salaries, benchmarking prediction performance using multiple approaches: traditional ML models, few-shot prompting, and fine-tuned LLMs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published