Skip to content

Conversation

@Nana-EC
Copy link
Contributor

@Nana-EC Nana-EC commented Nov 11, 2025

Reviewer Notes

Provide metric alerting value suggestions to allow operators to get going quickly on observing their BN

Related Issue(s)

Fixes #1615

@Nana-EC Nana-EC added this to the 0.23.0 milestone Nov 11, 2025
@Nana-EC Nana-EC self-assigned this Nov 11, 2025
@Nana-EC Nana-EC added Block Node Issues/PR related to the Block Node. Documentation Issues/PR related to documentation Metrics Issues related to Metrics labels Nov 11, 2025
@AlfredoG87 AlfredoG87 modified the milestones: 0.23.0, 0.24.0 Nov 14, 2025
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
@Nana-EC Nana-EC force-pushed the 1615-metric-alert-recommendations branch from 7409400 to 25f6041 Compare November 14, 2025 19:42
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
@Nana-EC Nana-EC marked this pull request as ready for review November 17, 2025 15:31
@Nana-EC Nana-EC requested review from a team as code owners November 17, 2025 15:31
@codecov
Copy link

codecov bot commented Nov 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@             Coverage Diff              @@
##               main    #1851      +/-   ##
============================================
+ Coverage     80.57%   80.58%   +0.01%     
+ Complexity     1178     1177       -1     
============================================
  Files           127      127              
  Lines          5550     5553       +3     
  Branches        591      591              
============================================
+ Hits           4472     4475       +3     
- Misses          805      806       +1     
+ Partials        273      272       -1     

see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@mustafauzunn mustafauzunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conventional-pr-title check is not passing.
Beside that looking good

Copy link
Contributor

@jsync-swirlds jsync-swirlds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some optional suggestions for alert conditions.

Comment on lines +271 to +272
| L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a bit too tight; we'd actually prefer to see utilization higher than 60% under load; over 80% might be worth alerting, depending on how large the queues are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 60% utilization makes sense here, specially since is a L severity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more concerned that under moderate load we're planning to alert when the node is at its most healthy. That seems backwards.
Utilization between 40% and 80% should be "green" and only alert outside that range (on either end).

| Severity | Metric | Alert Condition |
|----------|--------------------------------|-----------------------------------------------------|
| L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed |
| M | `publisher_receive_latency_ns` | If value exceeds 5s |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated below with a different value.

Comment on lines +278 to +281
| M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
| M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed |
| M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed |
| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are almost certainly far too long.
In order, "healthy" numbers would be (<=) .5, .02, .5, and .05 (in seconds)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.

and then my recommendation is as follows:

publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation.

hashing_block_time --> 250 ms

verification_block_time --> same as above 250 ms.

files_recent_persistence_time_latency_ns --> 100 ms.

Copy link
Contributor

@AlfredoG87 AlfredoG87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. just a couple of nit suggestions.

Comment on lines +278 to +281
| M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
| M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed |
| M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed |
| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.

and then my recommendation is as follows:

publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation.

hashing_block_time --> 250 ms

verification_block_time --> same as above 250 ms.

files_recent_persistence_time_latency_ns --> 100 ms.

Comment on lines +271 to +272
| L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 60% utilization makes sense here, specially since is a L severity.


| Severity | Metric | Alert Condition |
|----------|--------------------------------|-----------------------------------------------------|
| L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 40 is too much. I would recommend 20 or even 10. are we expecting to have that many publishers connected to a single node in Mainnet or even spheres?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Hedera mainnet we may have all 30-ish nodes connected at once; that is definitely not ideal, but with only 3 LFH nodes (out of 5 total), it could reasonably happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Block Node Issues/PR related to the Block Node. Documentation Issues/PR related to documentation Metrics Issues related to Metrics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify metrics for BN health and P1 plugins, update docs

5 participants