This project aims to perform customer segmentation based on demographic and transaction data from a wholesale company. Using unsupervised clustering techniques, customers will be grouped into several segments based on shared characteristics. The goal is to better understand customer behavior, allowing the company to maximize the value of each customer and tailor more effective marketing strategies.
Understanding a diverse customer base is crucial for any business. Treating all customers the same can lead to ineffective marketing and missed opportunities. The challenge is to move beyond a one-size-fits-all approach and identify distinct groups within the customer data.
The goal of this project is to build a clustering model that answers the question: "What are the distinct customer archetypes in our data?" by using customer personality and transaction data. This will enable the business to create targeted campaigns and improve customer satisfaction.
The dataset used in this project is "Customer Personality Analysis" available on Kaggle.
- Data Manipulation & Analysis:
pandas
,numpy
- Data Visualization:
matplotlib
,seaborn
,plotly
- Machine Learning:
scikit-learn
,scikit-learn-extra
,yellowbrick
- Environment: Jupyter Notebook
-
Clone the repository or download the project files.
-
Create and activate a virtual environment (recommended):
# Create the environment python -m venv .venv # Activate on Windows (PowerShell) .\.venv\Scripts\Activate.ps1 # Activate on macOS/Linux source .venv/bin/activate
-
Install the required dependencies from within the notebook: Run the first few cells in the notebook to install
scikit-learn-extra
andyellowbrick
. -
Launch Jupyter Notebook and open
notebook.ipynb
.
The analysis in this notebook follows these steps:
-
Data Cleaning:
- Handling missing values in the
Income
column by removing the corresponding rows.
- Handling missing values in the
-
Feature Engineering:
- Created new features like
Customer_For
,Age
,Spent
,Living_With
,Children
,Family_Size
, andIs_Parent
to better represent customer characteristics. - Simplified the
Education
andMarital_Status
categories. - Removed redundant or irrelevant features.
- Created new features like
-
Outlier Handling:
- Removing outliers in the
Age
andIncome
features to improve model quality.
- Removing outliers in the
-
Data Preprocessing:
- Label Encoding: Converting categorical features (
Education
,Living_With
) into a numerical format. - Standard Scaling: Scaling all numerical features to have a uniform distribution.
- Label Encoding: Converting categorical features (
-
Dimensionality Reduction:
- Using Principal Component Analysis (PCA) to reduce the data's dimensions to 3 principal components, which capture most of the data's variance.
-
Clustering Modeling:
- Elbow Method: Using
KElbowVisualizer
with a K-Means model to determine the optimal number of clusters (k
). The result shows that k=4 is the best number of clusters. - K-Medoids: Applying the K-Medoids algorithm to group the data into the 4 determined segments.
- Elbow Method: Using
-
Visualization:
- Creating a 3D scatter plot visualization of the clustering results to see the distribution of each customer segment.
This project is licensed under the MIT License.