Skip to content

Commit 4aa96aa

Browse files
committed
Add BN metric alert configuration suggestions
Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
1 parent 3773b0c commit 4aa96aa

File tree

1 file changed

+44
-0
lines changed

1 file changed

+44
-0
lines changed

docs/block-node/metrics.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,3 +233,47 @@ Provides metrics related to the backfill process, including On-Demand and Histor
233233
| Counter | `backfill_retries` | Total number of retries attempted during backfill |
234234
| Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) |
235235
| Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled |
236+
237+
238+
## Alerting Recommendations
239+
Alerting rules can be created in Prometheus based on these metrics to notify the operations team of potential issues.
240+
Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules include:
241+
242+
243+
Node Status: High level alerts for overall node health
244+
245+
| Severity | Metric | Alert Condition |
246+
|----------|--------------------|---------------------------|
247+
| M | `app_state_status` | If not equal to `RUNNING` |
248+
249+
Publisher: Alerts related to publisher connections and performance
250+
251+
| Severity | Metric | Alert Condition |
252+
|----------|---------------------------------------------|--------------------------------------------|
253+
| L | `publisher_open_connections` | If value exceeds 40 or configure as needed |
254+
| M | `publisher_receive_latency_ns` | If value exceeds 5 s |
255+
256+
Failures: Alerts for various failure metrics
257+
258+
| Severity | Metric | Alert Condition |
259+
|----------|----------------------------------------|-----------------------------------------------------------|
260+
| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60 s |
261+
| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s or configure as needed |
262+
| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s or configure as needed |
263+
| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s or configure as needed |
264+
265+
Messaging: Alerts for messaging service operations regarding block items and block notification
266+
267+
| Severity | Metric | Alert Condition |
268+
|----------|---------------------------------------------|--------------------------------------------------|
269+
| L | `messaging_item_queue_percent_used` | If percentage exceeds 60% or configure as needed |
270+
| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60% or configure as needed |
271+
272+
Latency: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks
273+
274+
| Severity | Metric | Alert Condition |
275+
|----------|--------------------------------------------|---------------------------------------------|
276+
| M | `publisher_receive_latency_ns` | If value exceeds 20s or configure as needed |
277+
| M | `hashing_block_time` | If value exceeds 2s or configure as needed |
278+
| M | `verification_block_time` | If value exceeds 20s or configure as needed |
279+
| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s or configure as needed |

0 commit comments

Comments
 (0)