Skip to content

Unsupervised clustering and PCA analysis of African-origin compounds using KMeans (k=4). The project identifies molecular feature patterns, visualizes clusters in reduced PCA space, and extracts representative compounds for each group.

License

Notifications You must be signed in to change notification settings

mike3119/African-compound-unsupervised-clustering-kmeans-pca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

African-compound-unsupervised-clustering-kmeans-pca

Unsupervised clustering and PCA analysis of African-origin compounds using KMeans (k=4). The project identifies molecular feature patterns, visualizes clusters in reduced PCA space, and extracts representative compounds for each group.

African Compound Clustering — PCA + KMeans (Unsupervised Learning)

This project performs unsupervised clustering on a curated dataset of African-origin chemical compounds to uncover structural patterns and feature-based groupings.

Using Principal Component Analysis (PCA) for dimensionality reduction and KMeans (k=4) for clustering, the notebook identifies the most representative compound from each cluster (closest to the centroid).


🧠 Project Summary

  • Goal: Identify natural groupings in chemical compound data using unsupervised methods.
  • Dataset: African database.xlsx — containing molecular descriptors such as MolWt, LogP, HBA, HBD, TPSA, and RotBonds.
  • Techniques:
    • StandardScaler normalization
    • Principal Component Analysis (PCA)
    • KMeans clustering (k=4)
    • Representative compound selection (centroid + medoid options)

⚙️ Workflow

  1. Data Preprocessing

    • Load Excel dataset (African database.xlsx)
    • Clean missing entries and scale features using StandardScaler
  2. Dimensionality Reduction

    • PCA (2 principal components) for visualization
    • Variance ratio analyzed to confirm feature importance
  3. Clustering

    • KMeans with k=4
    • Visualized clusters in PCA space
    • Cluster labels added to the dataframe
  4. Cluster Representation

    • Representative compound (closest to centroid)
    • Optionally, medoid compound (minimum average intra-cluster distance)
    • Results saved as CSV

🧪 Results Example

Cluster Representative SMILES Distance to Centroid
0 C=C1CC/C=C(\C)CCC2=C(C)... 22.07
1 OC1=CC=CC=C1O 9.91
2 CC1(C)CC[C@]2(C(=O)O)... 20.26
3 OCC1O[C@@H](OC2=CC=... 13.24

📊 Visualizations

  • PCA Scatter Plot — Compounds plotted by PC1 vs PC2, colored by cluster.
  • Feature Distribution Charts — Boxplots per cluster.
  • Cluster Centroid Distance Chart — Identifies representative compounds.

📂 Repository Structure

african-compound-clustering/ ├── README.md ├── requirements.txt ├── .gitignore ├── LICENSE ├── data/ │ └── African database.xlsx ├── notebooks/ │ └── unsupervisemodels.ipynb ├── results/ │ ├── cluster_representatives.csv │ └── cluster_medoids.csv └── figures/ ├── pca_clusters.png ├── feature_boxplots.png └── cluster_summary.png



🧩 Technologies Used

  • Python
  • Pandas
  • NumPy
  • Scikit-Learn
  • Matplotlib
  • Seaborn

👨🏽‍🔬 Author

Akosu Michael Hemen
Data Scientist | Machine Learning Researcher
GitHub: mike3119


📜 License

MIT License

About

Unsupervised clustering and PCA analysis of African-origin compounds using KMeans (k=4). The project identifies molecular feature patterns, visualizes clusters in reduced PCA space, and extracts representative compounds for each group.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published