Skip to content

babak2/synthea-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthea Data Analysis

This repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.

Repository Structure

The project is organised as follows:

├── README.md                         # Project overview, setup, & usage
├── synthea_data-analysis.ipynb       # Integrated notebook
├── requirements.txt                  # Python dependencies
├── .gitignore                        # Ignoring data dumps, etc.
├── data/
│   ├── original/                     # Raw Synthea data (input data)
│   └── processed/                    # Cleaned outputs from scripts
├── docs/
│   └── data_dictionary.md            # Data dictionary for reference
├── archive/                          # Archived scripts and notebooks
│   ├── scripts/                      # Python scripts
│   │   ├── 01_patient_cleaning.py
│   │   ├── 02_conditions_cleaning.py
│   │   ├── 03_observations_cleaning.py
│   │   ├── 04_medications_cleaning.py
│   │   ├── 05_encounters_cleaning.py
│   │   ├── 06_data_desc.py
│   │   ├── 07_hypertension_bp_bmi_analysis.py
│   │   ├── 08_compare_bp_bmi_hypertensive_vs_non.py
│   │   └── 09_hypertension_prevalence.py
│   └── notebooks/                    # Jupyter notebooks

Project Overview

This repository focuses on cleaning and analysing the synthetic healthcare data produced by the Synthea simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.

Analysis Pipeline

  1. Data Cleaning:
    The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters.

  2. Data Analysis:
    Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations.

  3. Reporting & Visualisation:
    The final results are summarised in reports, including figures and tables generated during analysis.

Install

To get started, you can set up the environment using pip. First, clone the repository:

git clone https://github.com/babak2/synthea_data-analysis.git
cd synthea_data-analysis 

Then, install the required dependencies:

pip install -r requirements.txt

Required Libraries

The project requires the following key Python libraries:

  • pandas: For data manipulation and cleaning

  • numpy: For numerical operations

  • matplotlib and seaborn: For data visualization

  • jupytext: To work seamlessly with Jupyter notebooks and scripts

For a full list of dependencies, check out the requirements.txt file.

Running the Scripts

The repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:

  1. Run individual Python scripts: Each script is designed to be executed in sequence. You can run any script individually using Python:

    python archive/scripts/01_patient_cleaning.py python archive/scripts/02_conditions_cleaning.py ... and so on for each script

  2. Execute the integrated Jupyter notebook: The final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:

    jupyter notebook synthea_data-analysis.ipynb

Data

The raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:

data/
├── original/   # Raw data
│   ├── patients.csv.gz
│   ├── conditions.csv.gz
│   ├── observations.csv.gz
│   └── ...
└── processed/  # Cleaned data
    ├── clean_patients.csv
    ├── clean_conditions.csv
    ├── clean_observations.csv
    └── ...

Contributing

If you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.

How to Contribute

  • Fork the repository.

  • Create a feature branch (git checkout -b feature-branch).

  • Commit your changes (git commit -am 'Add new feature').

  • Push to the branch (git push origin feature-branch).

  • Create a new Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Babak Mahdavi Ardestani

babak.m.ardestani@gmail.com

Releases

No releases published

Packages

No packages published