Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/block-node/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,49 @@ Provides metrics related to the backfill process, including On-Demand and Histor
| Counter | `backfill_retries` | Total number of retries attempted during backfill |
| Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) |
| Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled |

## Alerting Recommendations

Alerting rules can be created based on these metrics to notify the operations team of potential issues.
Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules to consider include:

Note: High level alerts are intentionally left out during the beta 1 phase to reduce noise.
As the product matures through beta and rc phases, high severity alerts will be added.

**Node Status**: High level alerts for overall node health

| Severity | Metric | Alert Condition |
|----------|--------------------|---------------------------|
| M | `app_state_status` | If not equal to `RUNNING` |

**Publisher**: Alerts related to publisher connections and performance

| Severity | Metric | Alert Condition |
|----------|--------------------------------|-----------------------------------------------------|
| L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 40 is too much. I would recommend 20 or even 10. are we expecting to have that many publishers connected to a single node in Mainnet or even spheres?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Hedera mainnet we may have all 30-ish nodes connected at once; that is definitely not ideal, but with only 3 LFH nodes (out of 5 total), it could reasonably happen.

| M | `publisher_receive_latency_ns` | If value exceeds 5s |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated below with a different value.


**Failures**: Alerts for various failure metrics

| Severity | Metric | Alert Condition |
|----------|----------------------------------------|--------------------------------------------------------------------|
| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60s |
| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed |

**Messaging**: Alerts for messaging service operations regarding block items and block notification

| Severity | Metric | Alert Condition |
|----------|---------------------------------------------|-----------------------------------------------------------|
| L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
Comment on lines +271 to +272
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a bit too tight; we'd actually prefer to see utilization higher than 60% under load; over 80% might be worth alerting, depending on how large the queues are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 60% utilization makes sense here, specially since is a L severity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more concerned that under moderate load we're planning to alert when the node is at its most healthy. That seems backwards.
Utilization between 40% and 80% should be "green" and only alert outside that range (on either end).


**Latency**: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks

| Severity | Metric | Alert Condition |
|----------|--------------------------------------------|------------------------------------------------------|
| M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
| M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed |
| M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed |
| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
Comment on lines +278 to +281
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are almost certainly far too long.
In order, "healthy" numbers would be (<=) .5, .02, .5, and .05 (in seconds)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these are too large values:
I would say that these numbers would vary on the network, mainnet/testnet vs sphere, however if we want to give recommendations for the current public networks we should state it at the top.

and then my recommendation is as follows:

publisher_receive_latency_ns --> 3 or 2.5 secs (here we are expecting at least 2 secs since that is the time the CN takes to wrap-up a block, and then some more due to transmission and block_proof generation.

hashing_block_time --> 250 ms

verification_block_time --> same as above 250 ms.

files_recent_persistence_time_latency_ns --> 100 ms.

Loading