Experimental triggerNextLedger timer Change #4865

SirTyson · 2025-08-04T20:11:48Z

Description

This adds an experimental flag that when set, uses the closeTime from the last externalized SCP message as the basis for setting the triggerNextLedger timer.

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

MonsieurNicolas · 2025-08-26T18:31:42Z

src/herder/HerderImpl.cpp

    uint64_t nextIndex = nextConsensusLedgerIndex();
    auto lastIndex = trackingConsensusLedgerIndex();

+    // Note, this is not the LCL externalized close time, but the close time of


I have a hard time reasoning about this -- why is lcl-1 more trustworthy than lcl?

I think that if you have concerns that a bad validator picks a bad value for close time, this should be handled in the validity checks not after externalizing.

Then this code only needs to be concerned about the impact of latency in both SCP and apply.

By the time you externalize, you already went through both consensus and apply, so now (assuming no clock drift) is:

under "low network load", close to lcl.ct+1 (with SCP overhead ~ 1s)

under "max capacity", close to lcl.ct+1+4 (SCP overhead and apply time combined are 5 seconds)

over "max capacity", larger than lcl.ct+5

I think that if you look at lcl-1, you end up with values that are off by 5 across the board -- I suspect lastLedgerStatingPoint = std::max(now - timeSinceExternalized, lastLedgerStatingPoint); is what makes this code recover? (some tests should allow you to confirm/deny this)

I think I misunderstood what the closeTime in the externalized message represented, this had nothing to do with being more trustworthy. I didn't realize closeTime for block N was the time at which SCP started, but thought it was the time that the message was broadcast at the end of SCP. Using N -1 assumed the time was the end of N - 1, not the beginning. I've pushed an update accordingly.

I had not run a supercluster test after adding the max clamp, but interestingly, the runs with the erroneous approach were still pretty good, and still saw a boost in TPS and fairly stable block times around 5 seconds (accept at times of heavy load). I'm not sure why it worked like that, and I doubt it was safe, but I'll dig into it a big more.

SirTyson · 2025-09-30T18:38:57Z

I've ran some tests with the fixed version, and tests look as expected. I ran tests on a 100 node networks (unfortunately I've been having issues with supercluster with larger networks) at 600 TPS to estimate a load and topology somewhat similar to main net. I also did a test on a smaller 7 node network at high TPS (2350). Here are the results:

100 node network (generated-overlay-topology-2) 650 TPS

current timer

Full run metrics

new timer logic

Full run metrics

Overall, the new timer logic has significantly lower and more consistent block times. There were modest spikes up to 6 seconds, and 3 / 100 nodes showed a ledger age above 5 seconds. Contrasted with the older timer, nodes saw ledger age spikes up to 10 seconds and about 60% of the network had an average ledger age > 5 seconds.

p75 nomination time and nomination timeouts are slightly better on the new timer logic, but are mostly unchanged. Notably the p99 nomination time and nomination time spikes are significantly reduced on the new timer logic, indicating better network stability.

SirTyson · 2025-09-30T20:58:47Z

7 node network (max TPS topology) 2350 TPS

Current Timer

Full run metrics

New Timer Logic

Full run metrics

Ledger age is significantly improved and more consistent. Nomination times are also significantly improved, with 1 less nomination timeout consistently observed and better, more consistent nomination overall.

However, ledger age is still higher than we would expect under load. With this timer change, we'd expect a fairly consistent 5 second ledger, but all nodes average much higher than 5 second ledgers. While the timing is much improved, it's still not what we intend.

SirTyson force-pushed the externalize-clock branch from 6af4b3f to 99cf3aa Compare August 5, 2025 18:15

SirTyson force-pushed the externalize-clock branch from 99cf3aa to 24db760 Compare August 25, 2025 20:00

MonsieurNicolas reviewed Aug 26, 2025

View reviewed changes

Added EXPERIMENTAL_TRIGGER_TIMER

52b7601

SirTyson force-pushed the externalize-clock branch from 24db760 to 52b7601 Compare August 28, 2025 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental triggerNextLedger timer Change #4865

Experimental triggerNextLedger timer Change #4865

Uh oh!

SirTyson commented Aug 4, 2025

Uh oh!

MonsieurNicolas Aug 26, 2025

Uh oh!

SirTyson Aug 28, 2025

Uh oh!

SirTyson commented Sep 30, 2025 •

edited

Loading

Uh oh!

SirTyson commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Experimental triggerNextLedger timer Change #4865

Are you sure you want to change the base?

Experimental triggerNextLedger timer Change #4865

Uh oh!

Conversation

SirTyson commented Aug 4, 2025

Description

Checklist

Uh oh!

MonsieurNicolas Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

SirTyson Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

SirTyson commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

100 node network (generated-overlay-topology-2) 650 TPS

current timer

new timer logic

Uh oh!

SirTyson commented Sep 30, 2025

7 node network (max TPS topology) 2350 TPS

Current Timer

New Timer Logic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SirTyson commented Sep 30, 2025 •

edited

Loading