This repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.
The project is organised as follows:
├── README.md # Project overview, setup, & usage
├── synthea_data-analysis.ipynb # Integrated notebook
├── requirements.txt # Python dependencies
├── .gitignore # Ignoring data dumps, etc.
├── data/
│ ├── original/ # Raw Synthea data (input data)
│ └── processed/ # Cleaned outputs from scripts
├── docs/
│ └── data_dictionary.md # Data dictionary for reference
├── archive/ # Archived scripts and notebooks
│ ├── scripts/ # Python scripts
│ │ ├── 01_patient_cleaning.py
│ │ ├── 02_conditions_cleaning.py
│ │ ├── 03_observations_cleaning.py
│ │ ├── 04_medications_cleaning.py
│ │ ├── 05_encounters_cleaning.py
│ │ ├── 06_data_desc.py
│ │ ├── 07_hypertension_bp_bmi_analysis.py
│ │ ├── 08_compare_bp_bmi_hypertensive_vs_non.py
│ │ └── 09_hypertension_prevalence.py
│ └── notebooks/ # Jupyter notebooks
This repository focuses on cleaning and analysing the synthetic healthcare data produced by the Synthea simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.
-
Data Cleaning:
The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters. -
Data Analysis:
Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations. -
Reporting & Visualisation:
The final results are summarised in reports, including figures and tables generated during analysis.
To get started, you can set up the environment using pip
. First, clone the repository:
git clone https://github.com/babak2/synthea_data-analysis.git
cd synthea_data-analysis
Then, install the required dependencies:
pip install -r requirements.txt
The project requires the following key Python libraries:
-
pandas: For data manipulation and cleaning
-
numpy: For numerical operations
-
matplotlib and seaborn: For data visualization
-
jupytext: To work seamlessly with Jupyter notebooks and scripts
For a full list of dependencies, check out the requirements.txt file.
The repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:
-
Run individual Python scripts: Each script is designed to be executed in sequence. You can run any script individually using Python:
python archive/scripts/01_patient_cleaning.py
python archive/scripts/02_conditions_cleaning.py
... and so on for each script -
Execute the integrated Jupyter notebook: The final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:
jupyter notebook synthea_data-analysis.ipynb
The raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:
data/
├── original/ # Raw data
│ ├── patients.csv.gz
│ ├── conditions.csv.gz
│ ├── observations.csv.gz
│ └── ...
└── processed/ # Cleaned data
├── clean_patients.csv
├── clean_conditions.csv
├── clean_observations.csv
└── ...
If you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.
How to Contribute
-
Fork the repository.
-
Create a feature branch (git checkout -b feature-branch).
-
Commit your changes (git commit -am 'Add new feature').
-
Push to the branch (git push origin feature-branch).
-
Create a new Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Babak Mahdavi Ardestani