African-compound-unsupervised-clustering-kmeans-pca

Unsupervised clustering and PCA analysis of African-origin compounds using KMeans (k=4). The project identifies molecular feature patterns, visualizes clusters in reduced PCA space, and extracts representative compounds for each group.

African Compound Clustering — PCA + KMeans (Unsupervised Learning)

This project performs unsupervised clustering on a curated dataset of African-origin chemical compounds to uncover structural patterns and feature-based groupings.

Using Principal Component Analysis (PCA) for dimensionality reduction and KMeans (k=4) for clustering, the notebook identifies the most representative compound from each cluster (closest to the centroid).

🧠 Project Summary

Goal: Identify natural groupings in chemical compound data using unsupervised methods.
Dataset: African database.xlsx — containing molecular descriptors such as MolWt, LogP, HBA, HBD, TPSA, and RotBonds.
Techniques:
- StandardScaler normalization
- Principal Component Analysis (PCA)
- KMeans clustering (k=4)
- Representative compound selection (centroid + medoid options)

⚙️ Workflow

Data Preprocessing
- Load Excel dataset (African database.xlsx)
- Clean missing entries and scale features using StandardScaler
Dimensionality Reduction
- PCA (2 principal components) for visualization
- Variance ratio analyzed to confirm feature importance
Clustering
- KMeans with k=4
- Visualized clusters in PCA space
- Cluster labels added to the dataframe
Cluster Representation
- Representative compound (closest to centroid)
- Optionally, medoid compound (minimum average intra-cluster distance)
- Results saved as CSV

🧪 Results Example

Cluster	Representative SMILES	Distance to Centroid
0	C=C1CC/C=C(\C)CCC2=C(C)...	22.07
1	OC1=CC=CC=C1O	9.91
2	CC1(C)CC[C@]2(C(=O)O)...	20.26
3	OCC1O[C@@H](OC2=CC=...	13.24

📊 Visualizations

PCA Scatter Plot — Compounds plotted by PC1 vs PC2, colored by cluster.
Feature Distribution Charts — Boxplots per cluster.
Cluster Centroid Distance Chart — Identifies representative compounds.

📂 Repository Structure

african-compound-clustering/ ├── README.md ├── requirements.txt ├── .gitignore ├── LICENSE ├── data/ │ └── African database.xlsx ├── notebooks/ │ └── unsupervisemodels.ipynb ├── results/ │ ├── cluster_representatives.csv │ └── cluster_medoids.csv └── figures/ ├── pca_clusters.png ├── feature_boxplots.png └── cluster_summary.png

🧩 Technologies Used

Python
Pandas
NumPy
Scikit-Learn
Matplotlib
Seaborn

👨🏽‍🔬 Author

Akosu Michael Hemen
Data Scientist | Machine Learning Researcher
GitHub: mike3119

📜 License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

African-compound-unsupervised-clustering-kmeans-pca

African Compound Clustering — PCA + KMeans (Unsupervised Learning)

🧠 Project Summary

⚙️ Workflow

🧪 Results Example

📊 Visualizations

📂 Repository Structure

🧩 Technologies Used

👨🏽‍🔬 Author

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
figures		figures
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

mike3119/African-compound-unsupervised-clustering-kmeans-pca

Folders and files

Latest commit

History

Repository files navigation

African-compound-unsupervised-clustering-kmeans-pca

African Compound Clustering — PCA + KMeans (Unsupervised Learning)

🧠 Project Summary

⚙️ Workflow

🧪 Results Example

📊 Visualizations

📂 Repository Structure

🧩 Technologies Used

👨🏽‍🔬 Author

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages