This project applies Fuzzy C-Means (FCM) clustering (via scikit-fuzzy
) to the Default of Credit Card Clients dataset using two features:
LIMIT_BAL
— credit limitBILL TOTAL
—BILL_AMT1+…+BILL_AMT6
The pipeline:
- Load
data/credit_card_clients.csv
(your exact path). - Create
BILL TOTAL
. - Scale
['LIMIT_BAL', 'BILL TOTAL']
to[0,1]
. - Run FCM for a sweep of cluster counts (
c = 2..10
). - Plot FPC vs
c
and a grid of mini-scatter plots. - Pick the
c
with the highest FPC and plot the final clustering.
FCM finds
-
$u_{ik}\in[0,1]$ , and$\sum_{i=1}^c u_{ik}=1$ for each$k$ (every column of$U$ sums to 1). - Unlike K-Means’ hard labels, FCM tells you how much each point belongs to each cluster.
Minimize the fuzzy within-cluster SSE with fuzzifier
Centers
Memberships
Stopping
Stop when maxiter
is reached.
We report the Fuzzy Partition Coefficient (FPC) per
-
Higher is better.
$\mathrm{FPC}\approx 1/c$ means very fuzzy/overlapping partitions. - In this 2-feature run, FPC peaks at
$c=2$ and then decreases as$c$ grows.
(Optionally, another index you may see is Xie–Beni:
How to read:
- The curve shows FPC for
c = 2..10
. - Pick the peak (here it’s at
c = 2
). - The downward slope after
c=2
means adding more clusters makes the partition fuzzier (less crisp separation) for these two features.
What you’re seeing:
- Each panel is an FCM run for a specific
c
. - Colors = hard labels from the fuzzy memberships (
argmax
across clusters). - Black/red squares = cluster centers (in scaled space).
- As
c
increases, the algorithm keeps subdividing the dense region at low limit / low bill totals. - FPC shown in each title steadily declines with
c
, indicating the split becomes less crisp.
Takeaway: For these two features, few clusters (especially c=2
) summarize the structure best. Large c
just slices the same mass in arbitrary ways.
Interpretation:
- Two broad groups appear in scaled space:
- smaller limits & small bills;
- higher limits & larger bills.
- The “X” markers are the fuzzy centers.
- Remember: points near the boundary have non-trivial memberships in both clusters; colors show the hard label for visualization only.
- Always scale features before FCM (Euclidean metric).
- If you switch to more features, avoid heavy collinearity (or use a compact subset).
- If you ever get FPC ≈
1/c
across allc
, that’s a degenerate run (uniform memberships). Rerun with different initializations or adjust features/scale. - Soft memberships are great for: thresholding borderline points, ranking “how typical” a point is for a cluster, and flagging outliers (low max-membership).