This project is a complete, end-to-end data engineering pipeline that demonstrates a real-world data processing architecture. It captures live air quality data from major Indian cities, processes it in real-time using Apache Spark, and stores the historical trends in a PostgreSQL database. The final, cleaned data is presented in an interactive Streamlit dashboard.
The final output is a clean, auto-refreshing dashboard that displays the latest air quality metrics and visualizes historical trends for each city.
The Spark UI shows the live processing of data batches as they arrive from the Kafka stream.
The pipeline follows a modern, decoupled, and scalable architecture:
Live API -> Kafka Producer -> Kafka Topic -> Spark Streaming (PySpark) -> PostgreSQL Database -> Streamlit Dashboard
- A Python Producer continuously fetches data from the World Air Quality Index API for multiple cities.
- This data is sent as JSON messages to an Apache Kafka topic, which acts as a durable, real-time message bus.
- An Apache Spark streaming job, running locally, connects to the Kafka topic, consumes the data in micro-batches, and performs real-time transformations and aggregations (calculating average AQI, determining health status).
- The processed, insightful data is then appended to a PostgreSQL database, creating a historical log of air quality trends.
- A Streamlit web application queries the PostgreSQL database and presents the data in an interactive, user-friendly dashboard that auto-refreshes.
- The entire backend infrastructure (Kafka, Zookeeper, PostgreSQL) is containerized and managed with Docker and Docker Compose for easy setup and portability.
- Docker and Docker Compose must be installed.
- You need a free API key from the World Air Quality Index Project.
- Clone this repository to your local machine.
- Open the
kafka_producers/air_quality_producer.pyscript. - Inside the script, replace the placeholder API key with your actual key.
From the project's root directory, start the Kafka and PostgreSQL containers.
docker-compose up -d
Important: Wait for about 45-60 seconds after running this command to allow the services to fully initialize.
Step 3: Start the Data Producer This script will fetch data from the API and send it to Kafka.
Open a new terminal.
Install the required Python libraries:
pip install -r kafka_producers/requirements.txt
Run the producer script:
python kafka_producers/air_quality_producer.py
Step 4: Start the Spark Processing Job This script will read from Kafka, process the data, and save it to PostgreSQL.
Open a third terminal.
Install the required Python libraries:
pip install pyspark==3.4.1
Run the Spark Streaming script:
python spark/app/stream_air_quality_processor.py
Step 5: Launch the Streamlit Dashboard This is the final step to view the results.
Open a fourth terminal.
Install the required Python libraries:
pip install -r dashboard/requirements.txt
Run the Streamlit app:
streamlit run dashboard/app.py


