This project implements a comprehensive Big Data architecture to predict pandemic risk levels, focusing on COVID-19 data analysis. The system processes historical COVID-19 data, trains a machine learning model, and provides real-time risk predictions through an interactive dashboard.
The solution leverages several key Big Data technologies:
- Storage: Hadoop HDFS (partitioned Parquet files)
- Processing: Apache Spark for batch processing and machine learning
- Streaming: Kafka and Spark Streaming for real-time data pipelines
- Database: PostgreSQL for prediction storage
- Visualization: Streamlit dashboard and Grafana monitoring
- Conversion of CSV data to optimized Parquet format with time-based partitioning
- Machine learning model (RandomForest) for risk classification
- Real-time data streaming pipeline with Kafka
- Interactive dashboards for risk visualization
- Geographic risk distribution with choropleth maps
- Time-series analysis of pandemic trends
csv_to_parquet.py
: Converts raw COVID-19 CSV data to partitioned Parquet format in HDFSrisk_model_training.py
: Trains and saves a RandomForest classification model for risk prediction
risk_kafka_producer.py
: Reads data from HDFS and streams to Kafka topic "risk_data"postgre_consumer.py
: Consumes data stream, applies ML model, and stores predictions in PostgreSQL
streamlit_dashboard.py
: Interactive web dashboard for data exploration and visualization- Grafana dashboards for monitoring and analytics
The project uses US COVID-19 data from 2023 with the following structure:
date, county, state, cases, deaths
Data is processed and augmented with risk scores and categories.
- Apache Hadoop
- Apache Spark
- Apache Kafka
- PostgreSQL
- Python 3.x with required packages (pyspark, kafka-python, streamlit, pandas, plotly)
- Clone the repository:
git clone https://github.com/Houssam-11/BigData-Architecture.git
cd covid-risk-prediction
- Set up your Hadoop environment:
hdfs dfs -mkdir -p /data/pandemics
hdfs dfs -mkdir -p /models
hdfs dfs -mkdir -p /checkpoints/pandemic_v2
- Upload your COVID-19 data:
hdfs dfs -put us-covid_19-2023.csv /data/pandemics/
- Install Python dependencies:
pip install pyspark kafka-python streamlit pandas plotly psycopg2-binary us
- Set up PostgreSQL database:
CREATE DATABASE pandemic_db;
CREATE USER spark_user WITH PASSWORD '1234';
GRANT ALL PRIVILEGES ON DATABASE pandemic_db TO spark_user;
\c pandemic_db
CREATE TABLE risk_predictions (
state TEXT,
county TEXT,
date DATE,
risk_category INTEGER,
predicted_risk_category INTEGER
);
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO spark_user;
- Process CSV data to Parquet:
spark-submit csv_to_parquet.py
- Train the risk prediction model:
spark-submit risk_model_training.py
- Start Kafka and create necessary topics:
kafka-topics.sh --create --topic risk_data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
- Run the Kafka producer:
python risk_kafka_producer.py
- Run the Spark Streaming consumer:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.postgresql:postgresql:42.2.27 postgre_consumer.py
- Launch the Streamlit dashboard:
streamlit run streamlit_dashboard.py
The final system provides:
- Risk classification with 96% accuracy
- Identification of high-risk pandemic zones
- Geographic visualization of risk distribution
- Time-based analysis of pandemic trends
- Integration with external data sources (weather, population density)
- Enhanced prediction models with deep learning
- Mobile application for real-time alerts
- Deployment to cloud infrastructure for scalability