-
Notifications
You must be signed in to change notification settings - Fork 14
Add BN metric alert configuration suggestions #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
7409400 to
25f6041
Compare
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #1851 +/- ##
============================================
+ Coverage 80.57% 80.58% +0.01%
+ Complexity 1178 1177 -1
============================================
Files 127 127
Lines 5550 5553 +3
Branches 591 591
============================================
+ Hits 4472 4475 +3
- Misses 805 806 +1
+ Partials 273 272 -1 see 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
mustafauzunn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conventional-pr-title check is not passing.
Beside that looking good
jsync-swirlds
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some optional suggestions for alert conditions.
| | L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | | ||
| | L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a bit too tight; we'd actually prefer to see utilization higher than 60% under load; over 80% might be worth alerting, depending on how large the queues are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that 60% utilization makes sense here, specially since is a L severity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more concerned that under moderate load we're planning to alert when the node is at its most healthy. That seems backwards.
Utilization between 40% and 80% should be "green" and only alert outside that range (on either end).
| | Severity | Metric | Alert Condition | | ||
| |----------|--------------------------------|-----------------------------------------------------| | ||
| | L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed | | ||
| | M | `publisher_receive_latency_ns` | If value exceeds 5s | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is repeated below with a different value.
| | M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed | | ||
| | M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are almost certainly far too long.
In order, "healthy" numbers would be (<=) .5, .02, .5, and .05 (in seconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.
and then my recommendation is as follows:
publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation.
hashing_block_time --> 250 ms
verification_block_time --> same as above 250 ms.
files_recent_persistence_time_latency_ns --> 100 ms.
AlfredoG87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. just a couple of nit suggestions.
| | M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed | | ||
| | M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.
and then my recommendation is as follows:
publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation.
hashing_block_time --> 250 ms
verification_block_time --> same as above 250 ms.
files_recent_persistence_time_latency_ns --> 100 ms.
| | L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | | ||
| | L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that 60% utilization makes sense here, specially since is a L severity.
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|--------------------------------|-----------------------------------------------------| | ||
| | L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 40 is too much. I would recommend 20 or even 10. are we expecting to have that many publishers connected to a single node in Mainnet or even spheres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Hedera mainnet we may have all 30-ish nodes connected at once; that is definitely not ideal, but with only 3 LFH nodes (out of 5 total), it could reasonably happen.
Reviewer Notes
Provide metric alerting value suggestions to allow operators to get going quickly on observing their BN
Related Issue(s)
Fixes #1615