Add BN metric alert configuration suggestions #1851

Nana-EC · 2025-11-11T03:10:24Z

Reviewer Notes

Provide metric alerting value suggestions to allow operators to get going quickly on observing their BN

Related Issue(s)

Fixes #1615

Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>

codecov · 2025-11-18T14:47:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@             Coverage Diff              @@
##               main    #1851      +/-   ##
============================================
+ Coverage     80.57%   80.58%   +0.01%     
+ Complexity     1178     1177       -1     
============================================
  Files           127      127              
  Lines          5550     5553       +3     
  Branches        591      591              
============================================
+ Hits           4472     4475       +3     
- Misses          805      806       +1     
+ Partials        273      272       -1

see 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mustafauzunn

conventional-pr-title check is not passing.
Beside that looking good

jsync-swirlds

Just some optional suggestions for alert conditions.

jsync-swirlds · 2025-11-18T15:15:37Z

docs/block-node/metrics.md

+| L        | `messaging_item_queue_percent_used`         | If percentage exceeds 60%, otherwise, configure as needed |
+| L        | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |


These are a bit too tight; we'd actually prefer to see utilization higher than 60% under load; over 80% might be worth alerting, depending on how large the queues are.

I think that 60% utilization makes sense here, specially since is a L severity.

I'm more concerned that under moderate load we're planning to alert when the node is at its most healthy. That seems backwards.
Utilization between 40% and 80% should be "green" and only alert outside that range (on either end).

jsync-swirlds · 2025-11-18T15:16:05Z

docs/block-node/metrics.md

+| Severity |             Metric             |                   Alert Condition                   |
+|----------|--------------------------------|-----------------------------------------------------|
+| L        | `publisher_open_connections`   | If value exceeds 40, otherwise, configure as needed |
+| M        | `publisher_receive_latency_ns` | If value exceeds 5s                                 |


This is repeated below with a different value.

jsync-swirlds · 2025-11-18T15:18:41Z

docs/block-node/metrics.md

+| M        | `publisher_receive_latency_ns`             | If value exceeds 20s, otherwise, configure as needed |
+| M        | `hashing_block_time`                       | If value exceeds 2s, otherwise, configure as needed  |
+| M        | `verification_block_time`                  | If value exceeds 20s, otherwise, configure as needed |
+| M        | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |


These are almost certainly far too long.
In order, "healthy" numbers would be (<=) .5, .02, .5, and .05 (in seconds)

I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.

and then my recommendation is as follows:

publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation. hashing_block_time --> 250 ms verification_block_time --> same as above 250 ms. files_recent_persistence_time_latency_ns --> 100 ms.

AlfredoG87

LGTM. just a couple of nit suggestions.

AlfredoG87 · 2025-11-24T17:14:25Z

docs/block-node/metrics.md

+| M        | `publisher_receive_latency_ns`             | If value exceeds 20s, otherwise, configure as needed |
+| M        | `hashing_block_time`                       | If value exceeds 2s, otherwise, configure as needed  |
+| M        | `verification_block_time`                  | If value exceeds 20s, otherwise, configure as needed |
+| M        | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |


I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.

and then my recommendation is as follows:

publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation. hashing_block_time --> 250 ms verification_block_time --> same as above 250 ms. files_recent_persistence_time_latency_ns --> 100 ms.

AlfredoG87 · 2025-11-24T17:15:08Z

docs/block-node/metrics.md

+| L        | `messaging_item_queue_percent_used`         | If percentage exceeds 60%, otherwise, configure as needed |
+| L        | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |


I think that 60% utilization makes sense here, specially since is a L severity.

AlfredoG87 · 2025-11-24T17:16:50Z

docs/block-node/metrics.md

+
+| Severity |             Metric             |                   Alert Condition                   |
+|----------|--------------------------------|-----------------------------------------------------|
+| L        | `publisher_open_connections`   | If value exceeds 40, otherwise, configure as needed |


I think 40 is too much. I would recommend 20 or even 10. are we expecting to have that many publishers connected to a single node in Mainnet or even spheres?

In Hedera mainnet we may have all 30-ish nodes connected at once; that is definitely not ideal, but with only 3 LFH nodes (out of 5 total), it could reasonably happen.

Nana-EC added this to the 0.23.0 milestone Nov 11, 2025

Nana-EC self-assigned this Nov 11, 2025

Nana-EC added Block Node Issues/PR related to the Block Node. Documentation Issues/PR related to documentation Metrics Issues related to Metrics labels Nov 11, 2025

AlfredoG87 modified the milestones: 0.23.0, 0.24.0 Nov 14, 2025

Nana-EC added 2 commits November 14, 2025 14:40

Add BN metric alert configuration suggestions

4aa96aa

Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>

apply spotless

25f6041

Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>

Nana-EC force-pushed the 1615-metric-alert-recommendations branch from 7409400 to 25f6041 Compare November 14, 2025 19:42

Added a note

8584f1a

Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>

Nana-EC marked this pull request as ready for review November 17, 2025 15:31

Nana-EC requested review from a team as code owners November 17, 2025 15:31

Nana-EC requested review from AlfredoG87, ata-nas, jasperpotts, jsync-swirlds, mustafauzunn and rockysingh November 17, 2025 15:31

mustafauzunn reviewed Nov 20, 2025

View reviewed changes

jsync-swirlds approved these changes Nov 20, 2025

View reviewed changes

AlfredoG87 approved these changes Nov 24, 2025

View reviewed changes

		\| L \| `messaging_item_queue_percent_used` \| If percentage exceeds 60%, otherwise, configure as needed \|
		\| L \| `messaging_notification_queue_percent_used` \| If percentage exceeds 60%, otherwise, configure as needed \|

Add BN metric alert configuration suggestions #1851

Are you sure you want to change the base?

Add BN metric alert configuration suggestions #1851

Conversation

Nana-EC commented Nov 11, 2025

Reviewer Notes

Related Issue(s)

Uh oh!

codecov bot commented Nov 18, 2025

Codecov Report

Uh oh!

mustafauzunn left a comment

Choose a reason for hiding this comment

Uh oh!

jsync-swirlds left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlfredoG87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants