This project aims to predict customer churn using machine learning techniques. The primary goal is to build a predictive model that can determine whether a customer will churn (leave) based on their attributes. The project utilizes various classification algorithms and evaluates model performance using precision, recall, and accuracy metrics.
Customer churn refers to the phenomenon where customers stop doing business with a company. It is crucial for companies to predict churn, as this helps in customer retention strategies. In this project, we use machine learning algorithms to predict whether a customer will churn based on several features like customer demographics, subscription details, and usage behavior.
- Logistic Regression
- Random Forest Classifier
- K-Nearest Neighbors (KNN)
The models are trained, evaluated, and compared based on their performance on a given dataset.
To get started with the project, follow the steps below:
-
Clone the repository:
git clone https://github.com/codehass/Customer-Churn-Prediction.git
-
Navigate to the project directory:
cd Customer-Churn-Prediction
-
Install the required dependencies:
You can install the required Python libraries via
pip
. Make sure you have Python 3.6+ installed.pip install -r requirements.txt
Alternatively, you can manually install the dependencies:
pip install pandas numpy scikit-learn matplotlib seaborn pytest
The data used in this project comes from the Customer Churn Prediction dataset. The dataset contains customer information, such as demographics, account information, usage patterns, and whether the customer has churned.
The dataset is assumed to be in CSV format (data-68e11476082f9096032105.csv
), with a column "Churn"
indicating whether a customer has churned (Yes) or not (No).
- Loading the Data: The raw data is loaded using
pandas
. - Feature Engineering: The features are processed to clean the data, handle missing values, encode categorical variables, and scale numerical features.
- Train-Test Split: The data is split into training and testing datasets using
train_test_split
fromsklearn
.
Three machine learning models are used for churn prediction:
- Logistic Regression
- Random Forest Classifier
- K-Nearest Neighbors (KNN)
Each model is trained on the processed dataset and evaluated based on:
- Accuracy: Overall correctness of the model.
- Precision: Correct positive predictions divided by all positive predictions.
- Recall: Correct positive predictions divided by actual positives.
- F1-Score: Harmonic mean of precision and recall.
Precision-Recall (PR) curves are plotted to visualize the trade-off between precision and recall for each model.
pipeline.py
: Contains functions for data preprocessing and splitting the dataset.eda_analysis.ipynb/
: Contains Jupyter notebooks for exploratory data analysis (EDA) and model experimentation.tests/
: Contains unit tests for various parts of the pipeline, ensuring that the data processing and modeling steps function correctly.
The model's performance is evaluated based on several metrics:
- Confusion Matrix
- Precision-Recall Curve
The precision-recall curve is plotted for all three models, and the one with the best trade-off between precision and recall is selected as the final model for churn prediction.
Open the eda_analysis.ipynb and run the cells in sequence. This notebook trains the three models: Logistic Regression, Random Forest, and K-Nearest Neighbors. The notebook will:
-
Load the dataset
-
Preprocess the data
-
Train each model
Evaluate each model’s performance based on accuracy, precision, recall, and F1-score
Plot Precision-Recall curves for comparison
2. View Results
After running the training and evaluation steps, you will see evaluation metrics such as:
Confusion Matrix
Precision-Recall Curve
These metrics will help you compare the performance of the models and choose the best one.
This project includes unit tests to ensure the correctness of various parts of the code, including:
- Data processing and splitting: Verify the consistency and correctness of train-test splits.
- Model evaluation: Ensure that the models are evaluated correctly.
To run the tests:
pytest
This will run all the tests in the tests/
directory.
Contributions are welcome! If you find a bug or want to improve the project, feel free to fork the repository and submit a pull request.
To contribute:
- Fork the repository
- Create a new branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.