Retail Data Streaming with Confluent Kafka

Overview

This project demonstrates a real-time data streaming pipeline using Python, Pandas, Confluent Kafka, and VS Code. It utilizes retail data sourced from Kaggle, where a producer publishes data to a Kafka cluster, and multiple consumers subscribe to Kafka topics to consume the data.

Key features include:

Data Streaming: Real-time data flow from producer to consumers.
Dynamic Observations: Changes observed based on read strategies, addition/deletion of consumers, and partition number.
Scalability: Designed to handle real-time streaming with a fixed partition number.

Project Features

Producer: Publishes retail data to Kafka topics.
Consumer: Consumes data from Kafka topics using a subscription model.
Partition Observations: Behavior analysis with a fixed partition number.
Concurrency: Supports multiple consumers.

Technologies Used

Python: Programming language for producer and consumer code.
Pandas: Library for data manipulation.
Kafka: Distributed messaging system for real-time data streaming.
Confluent Kafka: Managed Platform for Kafka used for this project.
VS Code: Code editor.
Dataset: Retail data sourced from Kaggle.

Prerequisites

To run this project, you need:

Python (version 3.8 or higher).
Confluent Kafka managed Kafka platform.
confluent-kafka library installed:
```
pip install confluent-kafka  
```
pandas library installed:
```
pip install pandas  
```
Retail dataset from Kaggle.

Getting Started

Step 1: Setting Up Kafka Cluster

Install Confluent Kafka and set up a Kafka cluster.
Create a topic in Kafka (e.g., retail-data).

Step 2: Configure the Producer and Consumer

Update the topic name and Kafka server configurations in producer.py and consumer.py.
Ensure the path to the dataset file (e.g., /retail_data.csv) is correctly referenced in producer.py.

Step 3: Run the Producer

Execute the producer to start streaming data to the Kafka topic:

python producer.py

Step 4: Run the Consumer(s)

Start one or more consumers to subscribe to the topic and consume data:

python consumer.py

Step 5: Observe the Behavior

Monitor data consumption and changes when:
- Adding or removing consumers.
- Adjusting read strategies in the consumer.

Observations

Partition Number: The project uses a fixed partition number due to limitations in Confluent Kafka's dynamic resizing.
Consumer Behavior: Real-time insights into how Kafka handles consumer addition and deletion.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
consumer.py		consumer.py
producer.py		producer.py
retail_data.csv		retail_data.csv
schema_for_producer.json		schema_for_producer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retail Data Streaming with Confluent Kafka

Overview

Project Features

Technologies Used

Prerequisites

Getting Started

Step 1: Setting Up Kafka Cluster

Step 2: Configure the Producer and Consumer

Step 3: Run the Producer

Step 4: Run the Consumer(s)

Step 5: Observe the Behavior

Observations

About

Uh oh!

Releases

Packages

Languages

License

rk8130/Kafka-Retail-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Retail Data Streaming with Confluent Kafka

Overview

Project Features

Technologies Used

Prerequisites

Getting Started

Step 1: Setting Up Kafka Cluster

Step 2: Configure the Producer and Consumer

Step 3: Run the Producer

Step 4: Run the Consumer(s)

Step 5: Observe the Behavior

Observations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages