Skip to content

Commit a67ac67

Browse files
authored
docs: Add BN metric alert configuration suggestions (#1851)
Provide metric alerting value suggestions to allow operators to get going quickly on observing their BN Signed-off-by: Nana Essilfie-Conduah <nana@swirldslabs.com>
1 parent 2c8ddf1 commit a67ac67

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

docs/block-node/metrics.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,3 +233,49 @@ Provides metrics related to the backfill process, including On-Demand and Histor
233233
| Counter | `backfill_retries` | Total number of retries attempted during backfill |
234234
| Gauge | `backfill_status` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) |
235235
| Gauge | `backfill_pending_blocks` | Number of blocks pending to be backfilled |
236+
237+
## Alerting Recommendations
238+
239+
Alerting rules can be created based on these metrics to notify the operations team of potential issues.
240+
Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules to consider include:
241+
242+
Note: High level alerts are intentionally left out during the beta 1 phase to reduce noise.
243+
As the product matures through beta and rc phases, high severity alerts will be added.
244+
245+
**Node Status**: High level alerts for overall node health
246+
247+
| Severity | Metric | Alert Condition |
248+
|----------|--------------------|---------------------------|
249+
| M | `app_state_status` | If not equal to `RUNNING` |
250+
251+
**Publisher**: Alerts related to publisher connections and performance
252+
253+
| Severity | Metric | Alert Condition |
254+
|----------|--------------------------------|-----------------------------------------------------|
255+
| L | `publisher_open_connections` | If value exceeds 40, otherwise, configure as needed |
256+
| M | `publisher_receive_latency_ns` | If value exceeds 5s |
257+
258+
**Failures**: Alerts for various failure metrics
259+
260+
| Severity | Metric | Alert Condition |
261+
|----------|----------------------------------------|--------------------------------------------------------------------|
262+
| M | `verification_blocks_error` | If errors during verification exceed 3 in last 60s |
263+
| M | `publisher_block_send_response_failed` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
264+
| L | `backfill_fetch_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
265+
| M | `publisher_stream_errors` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
266+
267+
**Messaging**: Alerts for messaging service operations regarding block items and block notification
268+
269+
| Severity | Metric | Alert Condition |
270+
|----------|---------------------------------------------|-----------------------------------------------------------|
271+
| L | `messaging_item_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
272+
| L | `messaging_notification_queue_percent_used` | If percentage exceeds 60%, otherwise, configure as needed |
273+
274+
**Latency**: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks
275+
276+
| Severity | Metric | Alert Condition |
277+
|----------|--------------------------------------------|------------------------------------------------------|
278+
| M | `publisher_receive_latency_ns` | If value exceeds 20s, otherwise, configure as needed |
279+
| M | `hashing_block_time` | If value exceeds 2s, otherwise, configure as needed |
280+
| M | `verification_block_time` | If value exceeds 20s, otherwise, configure as needed |
281+
| M | `files_recent_persistence_time_latency_ns` | If value exceeds 20s, otherwise, configure as needed |

0 commit comments

Comments
 (0)