Skip to content
FullstackCodingGuy edited this page Aug 8, 2024 · 6 revisions

Kafka is a distributed event storming platform, it is more than a messaging system. It is central hub of the integration architecture.

Apache Kafka is an open-source distributed streaming platform known for its high-throughput, fault-tolerant, and scalable data streaming capabilities.

Let's explore the key components of Kafka's architecture in a nutshell: Producers: Producers are responsible for publishing data records to Kafka topics. They can be applications or systems that generate data and send it to Kafka for processing.

Topics: Topics are categories or feeds to which data records are published by producers. Topics are organized into partitions, allowing data to be distributed and processed in parallel.

Partitions: Each topic is divided into multiple partitions. Partitions enable horizontal scalability and parallel processing within a Kafka cluster. They ensure that data within a topic is spread across multiple brokers for increased throughput.

Brokers: Brokers are the Kafka servers in the cluster that handle the storage and replication of data. Each broker manages one or more partitions and communicates with producers and consumers.

Consumers: Consumers are applications or systems that subscribe to Kafka topics and consume the data records. They read data from the partitions and process it based on their specific requirements.

Consumer Groups: Consumer groups are a way to scale out consumption and achieve load balancing. Consumers within a group coordinate to consume different partitions of a topic, ensuring parallel processing and fault tolerance.

ZooKeeper: In older versions of Kafka, ZooKeeper was used for managing and maintaining cluster metadata, including broker and consumer group information. However, starting from Apache Kafka 2.8, ZooKeeper is no longer a dependency, and Kafka uses its internal metadata management system.

Kafka Cluster: A Kafka cluster consists of multiple brokers working together to handle data streams. The cluster provides fault tolerance and scalability, allowing data to be replicated across brokers for high availability.

Connectors and Streams: Apache Kafka offers additional components like Kafka Connect and Kafka Streams. Kafka Connect simplifies integration with external data systems, enabling efficient data ingestion and extraction. Kafka Streams provides a high-level API for building stream processing applications on top of Kafka.

Kafka

  • is an event ledger, keeping track of all the messages that come in
  • is distributed in nature
  • is a redundant system
  • uses Messaging System Semantics (which means, it functions similar to the messaging system)
  • ensures Clustering as core principle - employs multiple nodes to distribute the load
  • ensures Durability & Ordering Guarantees

Use cases

Kafka can be employed for some of the use cases given below

  • Asynchronous processing (where synchronization is hard)
  • Scaling ETL Jobs / Data Pipelines / Big Data Ingest
  • Processing is error-prone (ex: parsing logic might throw exceptions due to invalid payload data)
  • Event Store (to go back to retry and perform certain operations)
  • Distributed Processing

Why Kafka?

Ordering

It is important that the delivery of the messages to be in sequential order, ex: Creating Order, Updating order to be in sequential, not in other way.

Horizontal Scaling

Push Vs Pub/Sub

Operations

It is a record based operation,

  • Key, Value, Timestamp
  • Immutable
  • Append Only
  • Persisted

3 Components

  • Broker: Node in the cluster
  • Producer: Writes the records to a broker
  • Consumer: Reads records from a broker

Kafka is not doing push records to consumers, instead, consumers connects to brokers and ask for records.

Clone this wiki locally