Skip to content

Refreshing the topology is a resource-intensive operation, especially in large clusters #3331

@Paragrf

Description

@Paragrf

Feature Request

Is your feature request related to a problem? Please describe

We discovered that enabling DynamicRefreshSources in large clusters causes significant performance overhead during topology refreshes. Testing in a cluster with 200 nodes showed that the topology refresh process alone consumes 1–2 CPU cores. We profiled the topology refresh using a flame graph and found that updateCache accounts for 75% of the total cost (with 200 nodes), and this proportion continues to grow as the cluster scales—reaching up to 95% at 1,000 nodes.

Describe the solution you'd like

Upon analysis, we found that the time complexity of the updateCache operation is O(N² * 16384) during topology refresh. This step is mainly responsible for transforming the topology from a node → slot perspective into a slot → node perspective, which is used by Lettuce for routing read traffic. However, this transformed view is not required for the KnownMajority-based topology selection logic.

Therefore, we propose an optimization: first perform the topology selection, and then run updateCache only on the selected optimal view. This would reduce the time complexity of updateCache from O(N² * 16384) to a constant level of O(16384) , significantly reducing the performance overhead of the topology refresh process.

Some flame in 200 nodes

  • Flame graph source file
    profile.tar.gz

  • Vanilla (updateCache is the purple part in the picture)
    Image

  • After optimization
    Image

Futher More

GC during the large cluster topology refresh process is still a problem, and memory pools can be considered in the future

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions