Skip to content

FroCode/Real_Streaming_Kafka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch

Table of Contents

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

System Architecture

System_architecture.png

The project is designed with the following components:

  • Data Source: We use yelp.com dataset for our pipeline.
  • TCP/IP Socket: Used to stream data over the network in chunks
  • Apache Spark: For data processing with its master and worker nodes.
  • Confluent Kafka: Our cluster on the cloud
  • Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
  • Kafka Connect: For connecting to elasticsearch
  • Elasticsearch: For indexing and querying

Technologies

  • Python
  • TCP/IP
  • Confluent Kafka
  • Apache Spark
  • Docker
  • Elasticsearch

Getting Started

  1. Clone the repository:

    git clone https://github.com/FroCode/Real_Streaming_Kafka.git
  2. Navigate to the project directory:

    cd Real_Streaming_Kafka
  3. Run Docker Compose to spin up the spark cluster:

    docker-compose up

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages