Synthea Data Analysis

This repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.

Repository Structure

The project is organised as follows:

├── README.md                         # Project overview, setup, & usage
├── synthea_data-analysis.ipynb       # Integrated notebook
├── requirements.txt                  # Python dependencies
├── .gitignore                        # Ignoring data dumps, etc.
├── data/
│   ├── original/                     # Raw Synthea data (input data)
│   └── processed/                    # Cleaned outputs from scripts
├── docs/
│   └── data_dictionary.md            # Data dictionary for reference
├── archive/                          # Archived scripts and notebooks
│   ├── scripts/                      # Python scripts
│   │   ├── 01_patient_cleaning.py
│   │   ├── 02_conditions_cleaning.py
│   │   ├── 03_observations_cleaning.py
│   │   ├── 04_medications_cleaning.py
│   │   ├── 05_encounters_cleaning.py
│   │   ├── 06_data_desc.py
│   │   ├── 07_hypertension_bp_bmi_analysis.py
│   │   ├── 08_compare_bp_bmi_hypertensive_vs_non.py
│   │   └── 09_hypertension_prevalence.py
│   └── notebooks/                    # Jupyter notebooks

Project Overview

This repository focuses on cleaning and analysing the synthetic healthcare data produced by the Synthea simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.

Analysis Pipeline

Data Cleaning:
The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters.
Data Analysis:
Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations.
Reporting & Visualisation:
The final results are summarised in reports, including figures and tables generated during analysis.

Install

To get started, you can set up the environment using pip. First, clone the repository:

git clone https://github.com/babak2/synthea_data-analysis.git
cd synthea_data-analysis

Then, install the required dependencies:

pip install -r requirements.txt

Required Libraries

The project requires the following key Python libraries:

pandas: For data manipulation and cleaning
numpy: For numerical operations
matplotlib and seaborn: For data visualization
jupytext: To work seamlessly with Jupyter notebooks and scripts

For a full list of dependencies, check out the requirements.txt file.

Running the Scripts

The repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:

Run individual Python scripts: Each script is designed to be executed in sequence. You can run any script individually using Python:

python archive/scripts/01_patient_cleaning.py python archive/scripts/02_conditions_cleaning.py ... and so on for each script
Execute the integrated Jupyter notebook: The final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:

jupyter notebook synthea_data-analysis.ipynb

Data

The raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:

data/
├── original/   # Raw data
│   ├── patients.csv.gz
│   ├── conditions.csv.gz
│   ├── observations.csv.gz
│   └── ...
└── processed/  # Cleaned data
    ├── clean_patients.csv
    ├── clean_conditions.csv
    ├── clean_observations.csv
    └── ...

Contributing

If you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.

How to Contribute

Fork the repository.
Create a feature branch (git checkout -b feature-branch).
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature-branch).
Create a new Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Babak Mahdavi Ardestani

babak.m.ardestani@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthea Data Analysis

Repository Structure

Project Overview

Analysis Pipeline

Install

Required Libraries

Running the Scripts

Data

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
archive		archive
data		data
docs		docs
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
synthea_data-analysis.ipynb		synthea_data-analysis.ipynb

License

babak2/synthea-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Synthea Data Analysis

Repository Structure

Project Overview

Analysis Pipeline

Install

Required Libraries

Running the Scripts

Data

Contributing

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages