Skip to content

try: sequential query+processing pattern iterating over the partitions (CassandraPartitionedReadOnlyKeyValueStore) #26

@hartmut-co-uk

Description

@hartmut-co-uk

Idea came up while writing a blog post on the topic
-> https://thriving.dev/blog/interactive-queries-with-kafka-streams-cassandra-state-store-part-2

Context

  • The behaviour for large state stores has yet to be tested.
  • Cassandra fetches data in chunks, so reading / iterating over a large number of rows should still be possible when timeouts are not breached.
  • Current implementation, for the non-single result queries (all, range, prefixScan, query), is executing the cql query for all partitions in parallel, then iterating the (result) iterators one by one in sequential order.
    • This should be correct from a consistency point of view, but for large states, would the cassandra ResultSet iterators on-hold time out before being consumed?
    • While results are fetched in chunks, with the 'parallel query' pattern, first chunks for all partition would be fetched at once and demand RAM.
    • Maybe it would be better to switch to a sequential query+processing pattern iterating over the partitions (see org.apache.kafka.streams.state.internals.CompositeKeyValueIterator). Or provide both and allow the user to choose which one to use.
    • It would be interesting to test and compare both options applied for different use cases and streams architectures.

Contributions are welcome, please leave a comment or PM me on twitter!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions