Skip to content

Conversation

@SirTyson
Copy link
Contributor

@SirTyson SirTyson commented Aug 4, 2025

Description

This adds an experimental flag that when set, uses the closeTime from the last externalized SCP message as the basis for setting the triggerNextLedger timer.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

uint64_t nextIndex = nextConsensusLedgerIndex();
auto lastIndex = trackingConsensusLedgerIndex();

// Note, this is not the LCL externalized close time, but the close time of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time reasoning about this -- why is lcl-1 more trustworthy than lcl?

I think that if you have concerns that a bad validator picks a bad value for close time, this should be handled in the validity checks not after externalizing.

Then this code only needs to be concerned about the impact of latency in both SCP and apply.

By the time you externalize, you already went through both consensus and apply, so now (assuming no clock drift) is:

  • under "low network load", close to lcl.ct+1 (with SCP overhead ~ 1s)
  • under "max capacity", close to lcl.ct+1+4 (SCP overhead and apply time combined are 5 seconds)
  • over "max capacity", larger than lcl.ct+5

I think that if you look at lcl-1, you end up with values that are off by 5 across the board -- I suspect lastLedgerStatingPoint = std::max(now - timeSinceExternalized, lastLedgerStatingPoint); is what makes this code recover? (some tests should allow you to confirm/deny this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood what the closeTime in the externalized message represented, this had nothing to do with being more trustworthy. I didn't realize closeTime for block N was the time at which SCP started, but thought it was the time that the message was broadcast at the end of SCP. Using N -1 assumed the time was the end of N - 1, not the beginning. I've pushed an update accordingly.

I had not run a supercluster test after adding the max clamp, but interestingly, the runs with the erroneous approach were still pretty good, and still saw a boost in TPS and fairly stable block times around 5 seconds (accept at times of heavy load). I'm not sure why it worked like that, and I doubt it was safe, but I'll dig into it a big more.

@SirTyson
Copy link
Contributor Author

SirTyson commented Sep 30, 2025

I've ran some tests with the fixed version, and tests look as expected. I ran tests on a 100 node networks (unfortunately I've been having issues with supercluster with larger networks) at 600 TPS to estimate a load and topology somewhat similar to main net. I also did a test on a smaller 7 node network at high TPS (2350). Here are the results:

100 node network (generated-overlay-topology-2) 650 TPS

current timer

Full run metrics

image image image image

new timer logic

Full run metrics

image image image image

Overall, the new timer logic has significantly lower and more consistent block times. There were modest spikes up to 6 seconds, and 3 / 100 nodes showed a ledger age above 5 seconds. Contrasted with the older timer, nodes saw ledger age spikes up to 10 seconds and about 60% of the network had an average ledger age > 5 seconds.

p75 nomination time and nomination timeouts are slightly better on the new timer logic, but are mostly unchanged. Notably the p99 nomination time and nomination time spikes are significantly reduced on the new timer logic, indicating better network stability.

@SirTyson
Copy link
Contributor Author

7 node network (max TPS topology) 2350 TPS

Current Timer

Full run metrics

image image image image

New Timer Logic

Full run metrics

image image image image

Ledger age is significantly improved and more consistent. Nomination times are also significantly improved, with 1 less nomination timeout consistently observed and better, more consistent nomination overall.

However, ledger age is still higher than we would expect under load. With this timer change, we'd expect a fairly consistent 5 second ledger, but all nodes average much higher than 5 second ledgers. While the timing is much improved, it's still not what we intend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants