-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Idea came up while writing a blog post on the topic
-> https://thriving.dev/blog/interactive-queries-with-kafka-streams-cassandra-state-store-part-2
Context
- The behaviour for large state stores has yet to be tested.
- Cassandra fetches data in chunks, so reading / iterating over a large number of rows should still be possible when timeouts are not breached.
- Current implementation, for the non-single result queries (all, range, prefixScan, query), is executing the cql query for all partitions in parallel, then iterating the (result) iterators one by one in sequential order.
- This should be correct from a consistency point of view, but for large states, would the cassandra
ResultSet
iterators on-hold time out before being consumed? - While results are fetched in chunks, with the 'parallel query' pattern, first chunks for all partition would be fetched at once and demand RAM.
- Maybe it would be better to switch to a sequential query+processing pattern iterating over the partitions (see
org.apache.kafka.streams.state.internals.CompositeKeyValueIterator
). Or provide both and allow the user to choose which one to use. - It would be interesting to test and compare both options applied for different use cases and streams architectures.
- This should be correct from a consistency point of view, but for large states, would the cassandra
Contributions are welcome, please leave a comment or PM me on twitter!
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed