From 4aa96aafb8001263366299d815c7426926473660 Mon Sep 17 00:00:00 2001 From: Nana Essilfie-Conduah Date: Mon, 10 Nov 2025 22:08:00 -0500 Subject: [PATCH 1/3] Add BN metric alert configuration suggestions Signed-off-by: Nana Essilfie-Conduah --- docs/block-node/metrics.md | 44 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/docs/block-node/metrics.md b/docs/block-node/metrics.md index 82751c21b..432609581 100644 --- a/docs/block-node/metrics.md +++ b/docs/block-node/metrics.md @@ -233,3 +233,47 @@ Provides metrics related to the backfill process, including On-Demand and Histor | Counter | `backfill_retries` | Total number of retries attempted during backfill | | Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) | | Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled | + + +## Alerting Recommendations +Alerting rules can be created in Prometheus based on these metrics to notify the operations team of potential issues. +Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules include: + + +Node Status: High level alerts for overall node health + +| Severity | Metric | Alert Condition | +|----------|--------------------|---------------------------| +| M | `app_state_status` | If not equal to `RUNNING` | + +Publisher: Alerts related to publisher connections and performance + +| Severity | Metric | Alert Condition | +|----------|---------------------------------------------|--------------------------------------------| +| L | `publisher_open_connections` | If value exceeds 40 or configure as needed | +| M | `publisher_receive_latency_ns` | If value exceeds 5 s | + +Failures: Alerts for various failure metrics + +| Severity | Metric | Alert Condition | +|----------|----------------------------------------|-----------------------------------------------------------| +| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60 s | +| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s or configure as needed | +| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s or configure as needed | +| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s or configure as needed | + +Messaging: Alerts for messaging service operations regarding block items and block notification + +| Severity | Metric | Alert Condition | +|----------|---------------------------------------------|--------------------------------------------------| +| L | `messaging_item_queue_percent_used` | If percentage exceeds 60% or configure as needed | +| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60% or configure as needed | + +Latency: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks + +| Severity | Metric | Alert Condition | +|----------|--------------------------------------------|---------------------------------------------| +| M | `publisher_receive_latency_ns` | If value exceeds 20s or configure as needed | +| M | `hashing_block_time` | If value exceeds 2s or configure as needed | +| M | `verification_block_time` | If value exceeds 20s or configure as needed | +| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s or configure as needed | From 25f6041b5848c99de0dce3978c32df8f07e8a90e Mon Sep 17 00:00:00 2001 From: Nana Essilfie-Conduah Date: Fri, 14 Nov 2025 14:42:05 -0500 Subject: [PATCH 2/3] apply spotless Signed-off-by: Nana Essilfie-Conduah --- docs/block-node/metrics.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/docs/block-node/metrics.md b/docs/block-node/metrics.md index 432609581..f30f136cf 100644 --- a/docs/block-node/metrics.md +++ b/docs/block-node/metrics.md @@ -234,28 +234,27 @@ Provides metrics related to the backfill process, including On-Demand and Histor | Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) | | Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled | - ## Alerting Recommendations + Alerting rules can be created in Prometheus based on these metrics to notify the operations team of potential issues. Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules include: - Node Status: High level alerts for overall node health -| Severity | Metric | Alert Condition | +| Severity | Metric | Alert Condition | |----------|--------------------|---------------------------| | M | `app_state_status` | If not equal to `RUNNING` | Publisher: Alerts related to publisher connections and performance -| Severity | Metric | Alert Condition | -|----------|---------------------------------------------|--------------------------------------------| -| L | `publisher_open_connections` | If value exceeds 40 or configure as needed | -| M | `publisher_receive_latency_ns` | If value exceeds 5 s | +| Severity | Metric | Alert Condition | +|----------|--------------------------------|--------------------------------------------| +| L | `publisher_open_connections` | If value exceeds 40 or configure as needed | +| M | `publisher_receive_latency_ns` | If value exceeds 5 s | Failures: Alerts for various failure metrics -| Severity | Metric | Alert Condition | +| Severity | Metric | Alert Condition | |----------|----------------------------------------|-----------------------------------------------------------| | M | `verification_blocks_error` | If errors during verification exceed 3 in last 60 s | | M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s or configure as needed | @@ -264,14 +263,14 @@ Failures: Alerts for various failure metrics Messaging: Alerts for messaging service operations regarding block items and block notification -| Severity | Metric | Alert Condition | +| Severity | Metric | Alert Condition | |----------|---------------------------------------------|--------------------------------------------------| | L | `messaging_item_queue_percent_used` | If percentage exceeds 60% or configure as needed | | L | `messaging_notification_queue_percent_used` | If percentage exceeds 60% or configure as needed | Latency: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks -| Severity | Metric | Alert Condition | +| Severity | Metric | Alert Condition | |----------|--------------------------------------------|---------------------------------------------| | M | `publisher_receive_latency_ns` | If value exceeds 20s or configure as needed | | M | `hashing_block_time` | If value exceeds 2s or configure as needed | From 8584f1a81ba243cefdfb654ba9bfc216d2ae8355 Mon Sep 17 00:00:00 2001 From: Nana Essilfie-Conduah Date: Mon, 17 Nov 2025 10:30:50 -0500 Subject: [PATCH 3/3] Added a note Signed-off-by: Nana Essilfie-Conduah --- docs/block-node/metrics.md | 57 ++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/docs/block-node/metrics.md b/docs/block-node/metrics.md index f30f136cf..5446e2dd8 100644 --- a/docs/block-node/metrics.md +++ b/docs/block-node/metrics.md @@ -236,43 +236,46 @@ Provides metrics related to the backfill process, including On-Demand and Histor ## Alerting Recommendations -Alerting rules can be created in Prometheus based on these metrics to notify the operations team of potential issues. -Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules include: +Alerting rules can be created based on these metrics to notify the operations team of potential issues. +Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules to consider include: -Node Status: High level alerts for overall node health +Note: High level alerts are intentionally left out during the beta 1 phase to reduce noise. +As the product matures through beta and rc phases, high severity alerts will be added. + +**Node Status**: High level alerts for overall node health | Severity | Metric | Alert Condition | |----------|--------------------|---------------------------| | M | `app_state_status` | If not equal to `RUNNING` | -Publisher: Alerts related to publisher connections and performance +**Publisher**: Alerts related to publisher connections and performance -| Severity | Metric | Alert Condition | -|----------|--------------------------------|--------------------------------------------| -| L | `publisher_open_connections` | If value exceeds 40 or configure as needed | -| M | `publisher_receive_latency_ns` | If value exceeds 5 s | +| Severity | Metric | Alert Condition | +|----------|--------------------------------|-----------------------------------------------------| +| L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed | +| M | `publisher_receive_latency_ns` | If value exceeds 5s | -Failures: Alerts for various failure metrics +**Failures**: Alerts for various failure metrics -| Severity | Metric | Alert Condition | -|----------|----------------------------------------|-----------------------------------------------------------| -| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60 s | -| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s or configure as needed | -| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s or configure as needed | -| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s or configure as needed | +| Severity | Metric | Alert Condition | +|----------|----------------------------------------|--------------------------------------------------------------------| +| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60s | +| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s, otherwise, configure as needed | +| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed | +| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed | -Messaging: Alerts for messaging service operations regarding block items and block notification +**Messaging**: Alerts for messaging service operations regarding block items and block notification -| Severity | Metric | Alert Condition | -|----------|---------------------------------------------|--------------------------------------------------| -| L | `messaging_item_queue_percent_used` | If percentage exceeds 60% or configure as needed | -| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60% or configure as needed | +| Severity | Metric | Alert Condition | +|----------|---------------------------------------------|-----------------------------------------------------------| +| L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | +| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed | -Latency: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks +**Latency**: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks -| Severity | Metric | Alert Condition | -|----------|--------------------------------------------|---------------------------------------------| -| M | `publisher_receive_latency_ns` | If value exceeds 20s or configure as needed | -| M | `hashing_block_time` | If value exceeds 2s or configure as needed | -| M | `verification_block_time` | If value exceeds 20s or configure as needed | -| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s or configure as needed | +| Severity | Metric | Alert Condition | +|----------|--------------------------------------------|------------------------------------------------------| +| M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed | +| M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed | +| M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed | +| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |