Skip to content

This project aims to predict customer churn using machine learning techniques. The primary goal is to build a predictive model that can determine whether a customer will churn (leave) based on their attributes.

Notifications You must be signed in to change notification settings

codehass/Customer-Churn-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction

This project aims to predict customer churn using machine learning techniques. The primary goal is to build a predictive model that can determine whether a customer will churn (leave) based on their attributes. The project utilizes various classification algorithms and evaluates model performance using precision, recall, and accuracy metrics.

Table of Contents

Overview

Customer churn refers to the phenomenon where customers stop doing business with a company. It is crucial for companies to predict churn, as this helps in customer retention strategies. In this project, we use machine learning algorithms to predict whether a customer will churn based on several features like customer demographics, subscription details, and usage behavior.

The following models are implemented:

  • Logistic Regression
  • Random Forest Classifier
  • K-Nearest Neighbors (KNN)

The models are trained, evaluated, and compared based on their performance on a given dataset.

Installation

To get started with the project, follow the steps below:

  1. Clone the repository:

    git clone https://github.com/codehass/Customer-Churn-Prediction.git
  2. Navigate to the project directory:

    cd Customer-Churn-Prediction
  3. Install the required dependencies:

    You can install the required Python libraries via pip. Make sure you have Python 3.6+ installed.

    pip install -r requirements.txt

    Alternatively, you can manually install the dependencies:

    pip install pandas numpy scikit-learn matplotlib seaborn pytest

Data

The data used in this project comes from the Customer Churn Prediction dataset. The dataset contains customer information, such as demographics, account information, usage patterns, and whether the customer has churned.

The dataset is assumed to be in CSV format (data-68e11476082f9096032105.csv), with a column "Churn" indicating whether a customer has churned (Yes) or not (No).

Data Preprocessing

  1. Loading the Data: The raw data is loaded using pandas.
  2. Feature Engineering: The features are processed to clean the data, handle missing values, encode categorical variables, and scale numerical features.
  3. Train-Test Split: The data is split into training and testing datasets using train_test_split from sklearn.

Modeling

Three machine learning models are used for churn prediction:

  1. Logistic Regression
  2. Random Forest Classifier
  3. K-Nearest Neighbors (KNN)

Model Training and Evaluation

Each model is trained on the processed dataset and evaluated based on:

  • Accuracy: Overall correctness of the model.
  • Precision: Correct positive predictions divided by all positive predictions.
  • Recall: Correct positive predictions divided by actual positives.
  • F1-Score: Harmonic mean of precision and recall.

Precision-Recall (PR) curves are plotted to visualize the trade-off between precision and recall for each model.

Code Implementation

  • pipeline.py: Contains functions for data preprocessing and splitting the dataset.
  • eda_analysis.ipynb/: Contains Jupyter notebooks for exploratory data analysis (EDA) and model experimentation.
  • tests/: Contains unit tests for various parts of the pipeline, ensuring that the data processing and modeling steps function correctly.

Evaluation

The model's performance is evaluated based on several metrics:

  • Confusion Matrix
  • Precision-Recall Curve

The precision-recall curve is plotted for all three models, and the one with the best trade-off between precision and recall is selected as the final model for churn prediction.

Running the Project

Open the eda_analysis.ipynb and run the cells in sequence. This notebook trains the three models: Logistic Regression, Random Forest, and K-Nearest Neighbors. The notebook will:

  • Load the dataset

  • Preprocess the data

  • Train each model

Evaluate each model’s performance based on accuracy, precision, recall, and F1-score

Plot Precision-Recall curves for comparison

2. View Results

After running the training and evaluation steps, you will see evaluation metrics such as:

Confusion Matrix

Precision-Recall Curve

These metrics will help you compare the performance of the models and choose the best one.

Tests

This project includes unit tests to ensure the correctness of various parts of the code, including:

  • Data processing and splitting: Verify the consistency and correctness of train-test splits.
  • Model evaluation: Ensure that the models are evaluated correctly.

To run the tests:

pytest

This will run all the tests in the tests/ directory.

Contributing

Contributions are welcome! If you find a bug or want to improve the project, feel free to fork the repository and submit a pull request.

To contribute:

  1. Fork the repository
  2. Create a new branch
  3. Make your changes
  4. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project aims to predict customer churn using machine learning techniques. The primary goal is to build a predictive model that can determine whether a customer will churn (leave) based on their attributes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published