Skip to content

Blog post about Kafka Queue #509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link

@tinaselenge tinaselenge commented Jun 24, 2025

Type of change

  • Typo/minor fix
  • New blog post (see the README for the process)
  • Other

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge tinaselenge marked this pull request as ready for review June 24, 2025 15:09
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. I left few comments.

Comment on lines +132 to +135
```
group.coordinator.rebalance.protocols: classic,consumer,share
unstable.api.versions.enable: true
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to share the Kafka CR with these values - for examle in simplified format, e.g.:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  # ...
spec:
  kafka:
    # ...
    config:
      group.coordinator.rebalance.protocols: classic,consumer,share
      unstable.api.versions.enable: true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to have the instructions here more directed at the reader, e.g. deploy the quickstart, then edit your Kafka CR to add these two settings...


We can take a look at how this information is stored in the <code>__share_group_state</code> topic by using the console consumer tool:
```
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic __shared_group_state | jq .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit disconnected as users cannot run this with Strimzi. I wonder if it would be possible to move it only behind the chapter where you have the cluster and change the command to allow users to try this themself.

I also wonder how did the shared group got there -> wouldn't it show up there only after being used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This screenshot (as well as the second one) seems a bit strange, as you see the closing } but not the initial one, so it is kind of starting in the middle. I wonder if we can include the start of some method or something there?

Also, the screenshots are not accessible - it might be great to have also an alternative link to the code in GitHub Gist or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth changing the screenshots now ... but Ideally they should avoid execing into the broker pods and running the Java commands there as that can cause problems such as running out of memory etc.

author: tina_selenge
---

The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to mention the early access limitations? IIRC it has also some upgrade impact which definitely should be mentioned with a warning to not use this in production?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is now in "Preview" state with a stable API, but not yet production-ready. Given that the Kafka 4.1.0 release is ongoing, it may make sense to hold this post until it is supported in a Strimzi release. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I would not hold it - or to be clear, I do not have problem if others want to wait with it, but I see no need for it. Just make it more clear what the state and the plan is.

Even if it is production ready in 4.1 and we wait for Strimzi 0.48, we would still need to mention it as the blog post is not directly related to any particular release. So one way or another, this will need to be covered even if we wait.


![Delivery state](/assets/images/posts/2025-06-24-kafka-queue-03.png)

In this example, offset 2 is the start offset that a share group is consuming from the partition and offset 2 and 4 are currently acquired by a consumer of the group for the first time, therefore delivery count is 1. However, the record at offset 3 is in <b>Available</b> state with delivery count of 2, which means a consumer of the group has attempted to deliver this record twice. This record will be retried until the maximum delivery count is reached. The record at offset 5 has been processed successfully and acknowledged and the record at offset 6 is the next available record to be acquired by a consumer of the share group, therefore it is the end offset for this share partition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is hard to read:

In this example, offset 2 is the start offset that a share group is consuming from the partition and offset 2 and 4 are currently acquired by a consumer of the group for the first time, therefore delivery count is 1.

strimzi-cluster-operator-5dd46b9985-dp2wt 1/1 Running 0 10m
```

The following screenshot of the split screen shows that I am sending records to the topic with 2 partitions (in left bottom terminal) and the other terminal windows are where the 3 share consumers are running printing the records with their partition number and offset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--record-size 1024 makes the terminal output hard to read. We can use size 10, or 3, ... to avoid to distract users.

This allows users to acknowledge an individual record depending on the outcome of the processing of the record. Each record is acknowledged using a call to <code>acknowledge()</code> which takes different types of acknowledgement as an argument: ACCEPT, RELEASE and REJECT.
This aligns with the actions that can be taken by share group consumers mentioned previously.

In the next example, I have used the new consumer configuration added for this feature, <b>share.acknowledgement.mode</b> and set it to "explicit". This configuration is set to "implicit" by default, which is why I didn’t need to set this configuration for the previous example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is why I didn’t need to set this configuration for the previous example.

But I still see you set share.acknowledgement.mode=explicit in the previous example?


If the client encountered an error not caused by the JSON mapping, it releases the record for another attempt because it could be a transient failure.

When no exception occurred during the processing, the consumer client accepts the record. Once all the records in the batch are processed and acknowledged individually, these states are stored locally in the consumer. Then the client has to call <code>commitSync()</code> or <code>commitAsync()</code> to commit the state to the internal topic, <code>__shared_group_state</code>. If any of the records in the batch hits an error or skips the processing, the client commits the state at that point. The records in the batch that were not processed or acknowledged yet, will be presented to the consumer client again as part of the same acquisition, therefore their delivery count will not be incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The records in the batch that were not processed or acknowledged yet, will be presented to the consumer client again as part of the same acquisition, therefore their delivery count will not be incremented.

I'm not familiar with this, but form the example below, the delivery count of offset 366 is incremented and then after 4th retried, it is archived. So it looks like it contradicts with each, right?


![Partition Assignments](/assets/images/posts/2025-06-24-kafka-queue-01.png)

Partitions are assigned to members of a shared group in round robin fashion while trying to maintain even balance in the assignment. Assignments in share groups are dynamic, when a consumer leaves or joins or when a partition is added, all the partitions are rebalanced across the members. As long as a consumer member continues to call the <code>poll()</code>, it stays in the group and continues to receive records from its assigned partitions. Similar to regular consumer groups, in the background, members of a shared group also send periodic heartbeats to the brokers. If a member doesn’t send a heartbeat request within the <b>group.share.session.timeout.ms</b>, it will be considered inactive and partitions will be reassigned to other members. If it is sending heartbeat requests to the broker, but it does not call the <code>poll()</code> within <b>max.poll.interval.ms</b>, then it will leave the group and the partitions will be reassigned as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

members of a shared group also send periodic heartbeats to the brokers

We could put it precisely here. Heartbeat is sent to the group coordinator.

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge nice job!
I left some comments for you to consider.

| Configuration | Type | Default | Description |
| :--: |:-------------| :-------------| :-------------|
| group.coordinator.rebalance.protocols | Broker | classic,consumer | It should be set to "classic,consumer,share" to enable share group. |
| unstable.api.versions.enable | Broker | false | It should be set to true, in order to use this feature until it is production ready .|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kafka 4.1.0 (currently being released), share groups evolved from "early access" to "preview", so the unstable.api.versions.enable internal configuration is not needed anymore. When formatting without feature flags, you also have to manually upgrade the share.version feature from 0 to 1.

author: tina_selenge
---

The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is now in "Preview" state with a stable API, but not yet production-ready. Given that the Kafka 4.1.0 release is ongoing, it may make sense to hold this post until it is supported in a Strimzi release. Wdyt?


The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.

This feature is only supported with KRaft clusters since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client API didn't change, so I don't see how it simplifies implementation. Maybe you wanted to say improve stability and performance or something along these lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KIP moved partitions assignments from client to server, so that's what I meant by simplifying the client but this line is actually not clear, so I will just change it as you suggested with something more generic.


The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>shared group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.

The key difference between share group and regular consumer group is how partitions get assigned to consumer members. With regular consumer groups, each partition is exclusively assigned to a single member of the consumer group, therefore users typically have as many consumer members as the number of partitions to maximise the parallelism in message processing. However, share groups balance partitions between all members of a share group, allowing multiple consumer members to fetch from the same partition. So users can have more consumers than the number of partitions, further increasing the parallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to mention that one downside of regular consumer groups is that you have to over provision the number of topic partitions in order to cope with peak loads that may only happen during a limited period of time?


![Partition Assignments](/assets/images/posts/2025-06-24-kafka-queue-01.png)

Partitions are assigned to members of a shared group in round robin fashion while trying to maintain even balance in the assignment. Assignments in share groups are dynamic, when a consumer leaves or joins or when a partition is added, all the partitions are rebalanced across the members. As long as a consumer member continues to call the <code>poll()</code>, it stays in the group and continues to receive records from its assigned partitions. Similar to regular consumer groups, in the background, members of a shared group also send periodic heartbeats to the brokers. If a member doesn’t send a heartbeat request within the <b>group.share.session.timeout.ms</b>, it will be considered inactive and partitions will be reassigned to other members. If it is sending heartbeat requests to the broker, but it does not call the <code>poll()</code> within <b>max.poll.interval.ms</b>, then it will leave the group and the partitions will be reassigned as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the assignment and record delivery side, we could add that there is no guarantee about ordering and that a consumer will get records from the same partition.


When a consumer in a share group fetches records, it acquires the records with a time-limited acquisition lock. While a record is acquired, it’s not available for other consumers. The lock is automatically released once the lock duration has elapsed and the record becomes available again for another delivery attempt. This makes sure delivery progress even when a consumer fails to process the record. The lock duration can be configured with the broker configuration, <b>group.share.record.lock.duration.ms</b> which is set to 30s by default.

The number of records acquired for a partition by consumers in a shared group is also limited. Once this limit is reached, fetching of records from the shared group temporarily pauses until the number of acquired records reduces. The limit can be configured with the broker configuration, <b>group.share.partition.max.record.locks</b> which is set to 200 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limit has been raised to 2000 in Kafka 4.1.0.

The delivery count determines which state the record should transition into later on. If the consumer releases the record or the lock duration has lapsed, it will go back into <b>Available</b> state as long as the delivery count has not reached the limit.
The record will go into <b>Acknowledged</b> state if it has been successfully processed by the consumer. If the consumer rejects the record or its delivery count has reached the limit, the record will go into <b>Archived</b> state.

In the traditional message queue, once a record is fetched by a consumer, it is no longer in the queue. You can think of <b>Archived</b> state, as the records that are taken off the queue and no longer available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future work (KIP-932), they mention the dead-letter queue enhancement. It may be worth noting here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned in the API section, but maybe will pull it here with more information comparing traditional queue and Kafka as Kate suggested.

I also added the following 2 configurations to my Kafka CR to enable this feature, as it is not enabled by default since it is not production ready yet:
```
group.coordinator.rebalance.protocols: classic,consumer,share
unstable.api.versions.enable: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed in Kafka 4.1.0.

Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few different comments about wording and things.

In terms of the overall structure. I wonder whether it might be nice to have the steps showing how to use the share group and comparing to a regular consumer group earlier, so e.g.

  • High level overview of queues in Kafka
  • Example of how to use both consumer group types and how they work different
  • More in-depth discussion about how it works (the sections you have on partition assignment, fetch mechanism, delivery state
  • Example of using the API

It might help the reader to understand better the difference before diving into some of the more technical details

@tinaselenge @scholzj @showuon what do you think?

@@ -0,0 +1,416 @@
---
layout: post
title: "Using Kafka Queue feature with Strimzi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what the official feature name is? I noticed in the KIP it's referred to as "Queues for Kafka", so should the title and other references be "Using Queues for Kafka with Strimzi"?

@showuon do you have any insights on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature name is "Share Groups", but Queues for Kafka is nice for marketing :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the queue reference is what is used most often. So we should stick with it as that is likely to be what users will be looking for. Queues for Kafka is probably what I saw most often.


The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.

This feature is only supported with KRaft clusters since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewording suggestion to fit with how we describe KRaft mode in the docs.

Also should we explain or point to docs for running in KRaft mode and enabling the new consumer rebalance protocol in Strimzi?

Suggested change
This feature is only supported with KRaft clusters since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.
This feature is only supported with Kafka clusters running in KRaft mode, since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.


This feature is only supported with KRaft clusters since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.

The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>shared group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a sentence explaining the difference between a traditional message queue and Kafka?


This feature is only supported with KRaft clusters since Zookeeper was removed in the Kafka 4.0 release. It is also based on the new consumer rebalance protocol introduced by [KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) that enhances stability, scalability and simplifies client implementations.

The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>shared group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's share group without the d right? Worth doing a check of the rest of the blog as we have a mixture of share group and shared group

Suggested change
The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>shared group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.
The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>share group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.

author: tina_selenge
---

The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with my other comment, and also adding the word "Apache" here to be clear we're are talking about upstream:

Suggested change
The Kafka Queue feature introduced by [KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka) as an early access in Kafka version 4.0 which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.
Queues for Kafka ([KIP-932](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka#KIP932:QueuesforKafka)) is introduced as early access in Apache Kafka version 4.0, which is supported by Strimzi 0.46. In this blog, I’m going to introduce this feature and how you can try it out with your Strimzi managed cluster.


The Kafka Queue feature allows you to use Kafka like a traditional message queue and increase parallelism of message processing beyond the number of partitions. It provides queue-like semantics through a new consumer group called <b>shared group</b>. This new type of group gives more fine-grained control on message acknowledgement and retries.

The key difference between share group and regular consumer group is how partitions get assigned to consumer members. With regular consumer groups, each partition is exclusively assigned to a single member of the consumer group, therefore users typically have as many consumer members as the number of partitions to maximise the parallelism in message processing. However, share groups balance partitions between all members of a share group, allowing multiple consumer members to fetch from the same partition. So users can have more consumers than the number of partitions, further increasing the parallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expand here on the fact that although multiple consumers are assigned to the same partition, each record on that partition is still only read by one consumer in the group?

Comment on lines +132 to +135
```
group.coordinator.rebalance.protocols: classic,consumer,share
unstable.api.versions.enable: true
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to have the instructions here more directed at the reader, e.g. deploy the quickstart, then edit your Kafka CR to add these two settings...


When no exception occurred during the processing, the consumer client accepts the record. Once all the records in the batch are processed and acknowledged individually, these states are stored locally in the consumer. Then the client has to call <code>commitSync()</code> or <code>commitAsync()</code> to commit the state to the internal topic, <code>__shared_group_state</code>. If any of the records in the batch hits an error or skips the processing, the client commits the state at that point. The records in the batch that were not processed or acknowledged yet, will be presented to the consumer client again as part of the same acquisition, therefore their delivery count will not be incremented.

<i>Inspecting __share_group_state topic</i>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a heading? Looking at the preview it's getting a little lost

@showuon
Copy link
Member

showuon commented Jul 25, 2025

I've added a few different comments about wording and things.

In terms of the overall structure. I wonder whether it might be nice to have the steps showing how to use the share group and comparing to a regular consumer group earlier, so e.g.

* High level overview of queues in Kafka

* Example of how to use both consumer group types and how they work different

* More in-depth discussion about how it works (the sections you have on partition assignment, fetch mechanism, delivery state

* Example of using the API

It might help the reader to understand better the difference before diving into some of the more technical details

@tinaselenge @scholzj @showuon what do you think?

I like this idea. Thanks Kate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants