@@ -233,3 +233,47 @@ Provides metrics related to the backfill process, including On-Demand and Histor
233233| Counter | ` backfill_retries ` | Total number of retries attempted during backfill |
234234| Gauge | ` backfill_status ` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) |
235235| Gauge | ` backfill_pending_blocks ` | Number of blocks pending to be backfilled |
236+
237+
238+ ## Alerting Recommendations
239+ Alerting rules can be created in Prometheus based on these metrics to notify the operations team of potential issues.
240+ Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules include:
241+
242+
243+ Node Status: High level alerts for overall node health
244+
245+ | Severity | Metric | Alert Condition |
246+ | ----------| --------------------| ---------------------------|
247+ | M | ` app_state_status ` | If not equal to ` RUNNING ` |
248+
249+ Publisher: Alerts related to publisher connections and performance
250+
251+ | Severity | Metric | Alert Condition |
252+ | ----------| ---------------------------------------------| --------------------------------------------|
253+ | L | ` publisher_open_connections ` | If value exceeds 40 or configure as needed |
254+ | M | ` publisher_receive_latency_ns ` | If value exceeds 5 s |
255+
256+ Failures: Alerts for various failure metrics
257+
258+ | Severity | Metric | Alert Condition |
259+ | ----------| ----------------------------------------| -----------------------------------------------------------|
260+ | M | ` verification_blocks_error ` | If errors during verification exceed 3 in last 60 s |
261+ | M | ` publisher_block_send_response_failed ` | If value exceeds 3 in the last 60s or configure as needed |
262+ | L | ` backfill_fetch_errors ` | If value exceeds 3 in the last 60s or configure as needed |
263+ | M | ` publisher_stream_errors ` | If value exceeds 3 in the last 60s or configure as needed |
264+
265+ Messaging: Alerts for messaging service operations regarding block items and block notification
266+
267+ | Severity | Metric | Alert Condition |
268+ | ----------| ---------------------------------------------| --------------------------------------------------|
269+ | L | ` messaging_item_queue_percent_used ` | If percentage exceeds 60% or configure as needed |
270+ | L | ` messaging_notification_queue_percent_used ` | If percentage exceeds 60% or configure as needed |
271+
272+ Latency: Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks
273+
274+ | Severity | Metric | Alert Condition |
275+ | ----------| --------------------------------------------| ---------------------------------------------|
276+ | M | ` publisher_receive_latency_ns ` | If value exceeds 20s or configure as needed |
277+ | M | ` hashing_block_time ` | If value exceeds 2s or configure as needed |
278+ | M | ` verification_block_time ` | If value exceeds 20s or configure as needed |
279+ | M | ` files_recent_persistence_time_latency_ns ` | If value exceeds 20s or configure as needed |
0 commit comments