-
Notifications
You must be signed in to change notification settings - Fork 14
docs: Add BN metric alert configuration suggestions #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -233,3 +233,49 @@ Provides metrics related to the backfill process, including On-Demand and Histor | |
| | Counter | `backfill_retries` | Total number of retries attempted during backfill | | ||
| | Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) | | ||
| | Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled | | ||
|
|
||
| ## Alerting Recommendations | ||
|
|
||
| Alerting rules can be created based on these metrics to notify the operations team of potential issues. | ||
| Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules to consider include: | ||
|
|
||
| Note: High level alerts are intentionally left out during the beta 1 phase to reduce noise. | ||
| As the product matures through beta and rc phases, high severity alerts will be added. | ||
|
|
||
| **Node Status**: High level alerts for overall node health | ||
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|--------------------|---------------------------| | ||
| | M | `app_state_status` | If not equal to `RUNNING` | | ||
|
|
||
| **Publisher**: Alerts related to publisher connections and performance | ||
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|--------------------------------|-----------------------------------------------------| | ||
| | L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed | | ||
| | M | `publisher_receive_latency_ns` | If value exceeds 5s | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is repeated below with a different value. |
||
|
|
||
| **Failures**: Alerts for various failure metrics | ||
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|----------------------------------------|--------------------------------------------------------------------| | ||
| | M | `verification_blocks_error` | If errors during verification exceed 3 in last 60s | | ||
| | M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s, otherwise, configure as needed | | ||
| | L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed | | ||
| | M | `publisher_stream_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed | | ||
|
|
||
| **Messaging**: Alerts for messaging service operations regarding block items and block notification | ||
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|---------------------------------------------|-----------------------------------------------------------| | ||
| | L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | | ||
| | L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | | ||
|
Comment on lines
+271
to
+272
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are a bit too tight; we'd actually prefer to see utilization higher than 60% under load; over 80% might be worth alerting, depending on how large the queues are.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that 60% utilization makes sense here, specially since is a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm more concerned that under moderate load we're planning to alert when the node is at its most healthy. That seems backwards. |
||
|
|
||
| **Latency**: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks | ||
|
|
||
| | Severity | Metric | Alert Condition | | ||
| |----------|--------------------------------------------|------------------------------------------------------| | ||
| | M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed | | ||
| | M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed | | ||
| | M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed | | ||
|
Comment on lines
+278
to
+281
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are almost certainly far too long.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that these are too large values: and then my recommendation is as follows: |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 40 is too much. I would recommend 20 or even 10. are we expecting to have that many publishers connected to a single node in Mainnet or even spheres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Hedera mainnet we may have all 30-ish nodes connected at once; that is definitely not ideal, but with only 3 LFH nodes (out of 5 total), it could reasonably happen.