Practical Data-Centric AI/ML for Biomedical Researchers

Overview

The module prioritizes practical, data-centric techniques, ensuring researchers can immediately apply their acquired data science and AI/ML knowledge to real-world problems. We aim to train the participants with the competencies and skills needed to make biomedical data FAIR (Findability, Accessibility, Interoperability, and Reusability) and AI/ML-ready. The module also utilizes a blend of engaging instructional videos, interactive turorials, hands-on exercises to facilitate self-directed learning and knowledge retention.

Watch the Introduction video

Click above image to watch overview video

Background

The landscape of biomedical research is experiencing a fundamental shift, transitioning from hypothesis-driven approaches to data-driven discoveries fueled by the large and complex datasets generated through high-throughput technologies. Effectively analyzing and extracting meaningful insights from these datasets requires researchers to be proficient in advanced computational methods such as Artificial Intelligence (AI) and Machine Learning (ML). Furthermore, cloud computing offers flexible, cost-effective, and powerful solutions for data storage, analysis, and collaboration without the infrastructure burden of individual institutions. However, unlocking the full potential of cloud-based AI/ML in biomedical research hinges on equipping researchers with the necessary skills and knowledge. Recognizing this gap, the National Institute of General Medical Sciences (NIGMS) launched the NIGMS Sandbox initiative, aiming to create a repository of cloud-based learning modules for diverse biomedical data science topics. This module, "Practical Data-Centric AI/ML for Biomedical Researchers" aligns perfectly with the NIGMS’s vision to expand the skilled workforce capable of harnessing the power of cloud computing and AI/ML. The module tackles the crucial challenge of upskilling biomedical researchers by equipping researchers with these skills to foster innovation, accelerate scientific discovery. By leveraging the NIGMS Sandbox and cloud platform, the module ensures broad accessibility. This democratizes access to cutting-edge knowledge, empowering researchers regardless of their institutional resources and fostering a more inclusive research landscape.

Software Requirements

These notebooks were designed to be used on AWS cloud computing platforms, with the aim of requiring nothing but the files within this GitHub repository. Therefore, software requirements should only require creation of SageMaker AI Notebook Instance and the downloading of this GitHub's files to that machine.

For more information on creating a virtual machine and downloading our GitHub repo to that machine, we have a before starting and getting started section below. Currently this section only includes information on how to do this using SageMaker AI on AWS Cloud Platform.

Before Starting

Setting Up AWS

1. Setting Up an AWS SageMaker Notebook Instance

Log in to AWS Management Console:
- Navigate to the AWS SageMaker Console (Find Amazon SageMaker AI in the Services, or search it in the search bar)
Create a SageMaker Notebook Instance:
- Click Notebooks" in the left navigation pane (under Applications and IDEs, in Amazon SageMaker AI).
- Click "Create notebook instance" at the top of the Notebook instances.
- Fill out the following details:
  - Notebook instance name: Provide a unique name (e.g., notebook-yourname-date).
  - Instance type: Choose ml.t3.medium (or a larger instance type if your dataset is large, more examples.
  - Click "Addtional configurations"
    - Lifecycle Configuration (Optional): Add a script to install any additional dependencies automatically.
    - Volumne size in GB: change it to 50
  - IAM Role: If you don’t have an existing role, create a new one with S3 full access and AmazonSageMakerFullAccess permissions.
Start the Notebook Instance:
- Click "Create notebook instance" and wait for the status to change to "InService".

You can also watch the AWS notebook setup video below for step-by-step instructions:

Click above image to watch notebook setup video

Note: For Submodule 5 Exercise 2, you will restart your instance and select a GPU instance.(e.g., g4dn.xlarge for CUDA support)

Getting Started

Open a Terminal in Your SageMaker AI Notebook Instance, navigate to the Terminal tab in your SageMaker AI notebook instance.

Clone the Repository using the git clone command to clone our GitHub repository:

cd SageMaker
git clone https://github.com/NIGMS/AI-ML-For-Biomedical-Researchers.git

Run the notebooks

We have five notebooks, each for a submodule. From the Notebook Interface:
- Open the desired notebook file (e.g. submodule_1.ipynb) from the Jupyter Notebook interface.
- Select python3 kernel if not specified in the submodule notebook.
- Run the cells sequentially or selectively using the "Run" button or keyboard shortcut (usually Shift+Enter).
Note

If you encounter kernel crash or package installation failure. You can manually remove and recreate Environment.
- In a terminal, remove the broken environment:
```
conda env remove -n python3
```
- Recreate it by copying from the factory environment:
```
cp -r /home/ec2-user/anaconda3/envs/JupyterSystemEnv /home/ec2-user/anaconda3/envs/python3
```
- Restart the Jupyter kernel.
Notebook layout

Each notebook starts with some video lectures about the topics and quizzes to evaluate your understanding. Each notebooks also have some tutorials to help you learn how to implement the concepts and methods introduced in the lectures in Python code. We also provide Exercises (Solutions) for you to practice and check your own work.

Architecture Design

Data

File Name	Summary	Details
messy_data.csv	The breast cancer dataset classifies breast cancer patient as either a recurrence or no recurrence of cancer. 'messy_data.csv' is modified from Breast Cancer Dataset so that various data cleaning techniques may be demonstrated.	Breast Cancer Dataset
pima-indians-diabetes.csv	The dataset classifies patient as either an onset of diabetes within five years or not.	Pima Indians Diabetes Dataset
hepatitis.data	The Hepatitis dataset is a medical dataset from the UCI Machine Learning Repository. It contains patient data related to hepatitis, which can be used for classification tasks, such as predicting patient survival.	Hepatitis Dataset
wdbc.data	The Wisconsin Breast Cancer Dataset (WBCD) is a widely used dataset for breast cancer diagnosis. It contains features computed from digitized images of fine needle aspirate (FNA) of breast masses, which describe characteristics of cell nuclei.	Wisconsin Breast Cancer Dataset
messy_wine_data.csv	The wine dataset is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. 'messy_wine_data.csv' is a modified from 'Wine recognition dataset' by introducing some missing values.	UCI Wine Dataset
ecoli.csv	Ecoli dataset is for predicting Protein Localization Sites in Ecoli.	Ecoli Dataset
tox21.csv	The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.	Toxicology Dataset
winequality-white.csv, winequality-red.csv	The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.	Wine Quality Dataset

Module Outline

Submodule 1 - Introduction

Learn AI/ML core concepts, diverse applications, introductory algorithms, ethical considerations, and data challenges.

Lecture
- Introduction to AI/ML
- AI/ML Basic Concepts
- AI/ML Applications
Tutorial
- Introduction to NumPy
- Introduction to Pandas
Exercise
- NumPy Exercise
- Pandas Exercise

Submodule 2 - Data Science Life Cycle, FAIR Data Principles, Data-Centric AI/ML, and Responsible AI/ML

Learn data science life cycle and FAIR principles for responsible data management. Systematically engineering the data used to build an AI/ML system, and understand fairness, transparency, and accountability in AI/ML development and deployment.

Lecture
- Data Science Life Cycle
- FAIR Data Principles and FAIRness Metrics
- Data-Centric AI/ML
- Responsible AI/ML
Tutorial
- Data Centric AI/ML
- Responsible AI/ML
Exercise
- Data Centric AI/ML
- Responsible AI/ML

Submodule 3 - Data Preparation

Learn practical data cleaning techniques, as well as feature engineering, feature scaling, and feature selection techniques.

Lecture
- Data Collection and Data Preparation
- Feature Engineering, Scaling and Selection
Tutorial
- Data Cleaning
  - Basic Data Clearning
  - Marking and Removal of Missing Data
  - Outlier Identification and Removal
  - Missing Data Imputation
- Feature Engineering
  - Encode Categorical Data
  - Change Numerical Data Distribution
  - Derive New Input Variables
- Feature Scaling
  - Numerical Data
  - Data With Outliers
- Feature Selection
  - Numerical Input Features
  - Categorical Input Features
  - Recursive Feature Elimination
Exercise
- Data Wrangling Exercise
- Feature Engineering Exercise
- Feature Scaling Exercise
- Feature Selection Exercise

Submodule 4 - Model Building, Evaluation, Interpretation, and Deployment

Explore different AI/ML models and model evaluation techniques, delve into interpretability methods, and learn best practices for model deployment.

Lecture
- AI/ML Models and Model Evaluation
- Model Tuning, Interpretation and Deployment
Tutorial
- Model Building and Evaluation
- Model Tunning, Model Interpretation, Model Deployment
- Predict Drug Activity for Androgen Receptor
Exercise
- Exploratory Analysis of Diabetes Risk Factors
- Predicting Diabetes Risk

Submodule 5 - AI/ML for Biomedical Applications

Show different types of AI/ML algorithms and their suitability for biomedical data. Explore real-world examples of AI/ML in various areas of biomedicine.

Lecture
- AI/ML Applications in Biomedicine
- Introduction to Deep Learning
Tutorial
- Pfam protein sequence classification using Tensorflow and Keras
- Predicting the Solubility of Small Molecules
Exercise
- Predicting Predicting Diabetes Risk Deep Neural Network
- Protein 3D Structure Prediction using LocalColabFold

Funding

The creation of this training module was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number 3T32GM142603-03S1. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.

License for Data

Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Contributors

The content of this training module was contributed by Chuming Chen, Cecilia Arighi, Ryan Moore, Alexa Bennett, Amelia Harrison, and Shawn Polson.

Acknowledgement

We would like to thank Sylvia Kinya, Aaron Onserio and Deloitte team for reviewing and testing our module.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
Data		Data
Submodule_1		Submodule_1
Submodule_2		Submodule_2
Submodule_3		Submodule_3
Submodule_4		Submodule_4
Submodule_5		Submodule_5
images		images
LICENSE		LICENSE
README.md		README.md
Submodule_1.ipynb		Submodule_1.ipynb
Submodule_2.ipynb		Submodule_2.ipynb
Submodule_3.ipynb		Submodule_3.ipynb
Submodule_4.ipynb		Submodule_4.ipynb
Submodule_5.ipynb		Submodule_5.ipynb
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Practical Data-Centric AI/ML for Biomedical Researchers

Contents

Overview

Background

Software Requirements

Before Starting

Getting Started

Architecture Design

Data

Module Outline

Funding

License for Data

Contributors

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

NIGMS/AI-ML-For-Biomedical-Researchers

Folders and files

Latest commit

History

Repository files navigation

Practical Data-Centric AI/ML for Biomedical Researchers

Contents

Overview

Background

Software Requirements

Before Starting

Getting Started

Architecture Design

Data

Module Outline

Funding

License for Data

Contributors

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages