This project demonstrates a real-time data streaming pipeline using Python, Pandas, Confluent Kafka, and VS Code. It utilizes retail data sourced from Kaggle, where a producer publishes data to a Kafka cluster, and multiple consumers subscribe to Kafka topics to consume the data.
Key features include:
- Data Streaming: Real-time data flow from producer to consumers.
- Dynamic Observations: Changes observed based on read strategies, addition/deletion of consumers, and partition number.
- Scalability: Designed to handle real-time streaming with a fixed partition number.
- Producer: Publishes retail data to Kafka topics.
- Consumer: Consumes data from Kafka topics using a subscription model.
- Partition Observations: Behavior analysis with a fixed partition number.
- Concurrency: Supports multiple consumers.
- Python: Programming language for producer and consumer code.
- Pandas: Library for data manipulation.
- Kafka: Distributed messaging system for real-time data streaming.
- Confluent Kafka: Managed Platform for Kafka used for this project.
- VS Code: Code editor.
- Dataset: Retail data sourced from Kaggle.
To run this project, you need:
- Python (version 3.8 or higher).
- Confluent Kafka managed Kafka platform.
- confluent-kafka library installed:
pip install confluent-kafka
- pandas library installed:
pip install pandas
- Retail dataset from Kaggle.
- Install Confluent Kafka and set up a Kafka cluster.
- Create a topic in Kafka (e.g.,
retail-data
).
- Update the topic name and Kafka server configurations in
producer.py
andconsumer.py
. - Ensure the path to the dataset file (e.g.,
/retail_data.csv
) is correctly referenced inproducer.py
.
Execute the producer to start streaming data to the Kafka topic:
python producer.py
Start one or more consumers to subscribe to the topic and consume data:
python consumer.py
- Monitor data consumption and changes when:
- Adding or removing consumers.
- Adjusting read strategies in the consumer.
- Partition Number: The project uses a fixed partition number due to limitations in Confluent Kafka's dynamic resizing.
- Consumer Behavior: Real-time insights into how Kafka handles consumer addition and deletion.