This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
The project is designed with the following components:
- Data Source: We use
yelp.comdataset for our pipeline. - TCP/IP Socket: Used to stream data over the network in chunks
- Apache Spark: For data processing with its master and worker nodes.
- Confluent Kafka: Our cluster on the cloud
- Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
- Kafka Connect: For connecting to elasticsearch
- Elasticsearch: For indexing and querying
- Python
- TCP/IP
- Confluent Kafka
- Apache Spark
- Docker
- Elasticsearch
-
Clone the repository:
git clone https://github.com/FroCode/Real_Streaming_Kafka.git
-
Navigate to the project directory:
cd Real_Streaming_Kafka -
Run Docker Compose to spin up the spark cluster:
docker-compose up
