Unsupervised clustering and PCA analysis of African-origin compounds using KMeans (k=4). The project identifies molecular feature patterns, visualizes clusters in reduced PCA space, and extracts representative compounds for each group.
This project performs unsupervised clustering on a curated dataset of African-origin chemical compounds to uncover structural patterns and feature-based groupings.
Using Principal Component Analysis (PCA) for dimensionality reduction and KMeans (k=4) for clustering, the notebook identifies the most representative compound from each cluster (closest to the centroid).
- Goal: Identify natural groupings in chemical compound data using unsupervised methods.
- Dataset:
African database.xlsx— containing molecular descriptors such asMolWt,LogP,HBA,HBD,TPSA, andRotBonds. - Techniques:
- StandardScaler normalization
- Principal Component Analysis (PCA)
- KMeans clustering (
k=4) - Representative compound selection (centroid + medoid options)
-
Data Preprocessing
- Load Excel dataset (
African database.xlsx) - Clean missing entries and scale features using
StandardScaler
- Load Excel dataset (
-
Dimensionality Reduction
- PCA (2 principal components) for visualization
- Variance ratio analyzed to confirm feature importance
-
Clustering
- KMeans with
k=4 - Visualized clusters in PCA space
- Cluster labels added to the dataframe
- KMeans with
-
Cluster Representation
- Representative compound (closest to centroid)
- Optionally, medoid compound (minimum average intra-cluster distance)
- Results saved as CSV
| Cluster | Representative SMILES | Distance to Centroid |
|---|---|---|
| 0 | C=C1CC/C=C(\C)CCC2=C(C)... | 22.07 |
| 1 | OC1=CC=CC=C1O | 9.91 |
| 2 | CC1(C)CC[C@]2(C(=O)O)... | 20.26 |
| 3 | OCC1O[C@@H](OC2=CC=... | 13.24 |
- PCA Scatter Plot — Compounds plotted by PC1 vs PC2, colored by cluster.
- Feature Distribution Charts — Boxplots per cluster.
- Cluster Centroid Distance Chart — Identifies representative compounds.
african-compound-clustering/ ├── README.md ├── requirements.txt ├── .gitignore ├── LICENSE ├── data/ │ └── African database.xlsx ├── notebooks/ │ └── unsupervisemodels.ipynb ├── results/ │ ├── cluster_representatives.csv │ └── cluster_medoids.csv └── figures/ ├── pca_clusters.png ├── feature_boxplots.png └── cluster_summary.png
- Python
- Pandas
- NumPy
- Scikit-Learn
- Matplotlib
- Seaborn
Akosu Michael Hemen
Data Scientist | Machine Learning Researcher
GitHub: mike3119
MIT License