-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Feature Request
Is your feature request related to a problem? Please describe
We discovered that enabling DynamicRefreshSources in large clusters causes significant performance overhead during topology refreshes. Testing in a cluster with 200 nodes showed that the topology refresh process alone consumes 1–2 CPU cores. We profiled the topology refresh using a flame graph and found that updateCache accounts for 75% of the total cost (with 200 nodes), and this proportion continues to grow as the cluster scales—reaching up to 95% at 1,000 nodes.
Describe the solution you'd like
Upon analysis, we found that the time complexity of the updateCache operation is O(N² * 16384) during topology refresh. This step is mainly responsible for transforming the topology from a node → slot perspective into a slot → node perspective, which is used by Lettuce for routing read traffic. However, this transformed view is not required for the KnownMajority-based topology selection logic.
Therefore, we propose an optimization: first perform the topology selection, and then run updateCache only on the selected optimal view. This would reduce the time complexity of updateCache from O(N² * 16384) to a constant level of O(16384) , significantly reducing the performance overhead of the topology refresh process.
Some flame in 200 nodes
-
Flame graph source file
profile.tar.gz
Futher More
GC during the large cluster topology refresh process is still a problem, and memory pools can be considered in the future