@@ -233,3 +233,49 @@ Provides metrics related to the backfill process, including On-Demand and Histor
233233| Counter | ` backfill_retries ` | Total number of retries attempted during backfill |
234234| Gauge | ` backfill_status ` | Current status of the backfill process (0=Idle, 1=Running, 2=Error, 3=On-Demand Error, 4=Unknown) |
235235| Gauge | ` backfill_pending_blocks ` | Number of blocks pending to be backfilled |
236+
237+ ## Alerting Recommendations
238+
239+ Alerting rules can be created based on these metrics to notify the operations team of potential issues.
240+ Utilizing Low (L), Medium (M) and High (H) severity levels, some recommended alerting rules to consider include:
241+
242+ Note: High level alerts are intentionally left out during the beta 1 phase to reduce noise.
243+ As the product matures through beta and rc phases, high severity alerts will be added.
244+
245+ ** Node Status** : High level alerts for overall node health
246+
247+ | Severity | Metric | Alert Condition |
248+ | ----------| --------------------| ---------------------------|
249+ | M | ` app_state_status ` | If not equal to ` RUNNING ` |
250+
251+ ** Publisher** : Alerts related to publisher connections and performance
252+
253+ | Severity | Metric | Alert Condition |
254+ | ----------| --------------------------------| -----------------------------------------------------|
255+ | L | ` publisher_open_connections ` | If value exceeds 40, otherwise, configure as needed |
256+ | M | ` publisher_receive_latency_ns ` | If value exceeds 5s |
257+
258+ ** Failures** : Alerts for various failure metrics
259+
260+ | Severity | Metric | Alert Condition |
261+ | ----------| ----------------------------------------| --------------------------------------------------------------------|
262+ | M | ` verification_blocks_error ` | If errors during verification exceed 3 in last 60s |
263+ | M | ` publisher_block_send_response_failed ` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
264+ | L | ` backfill_fetch_errors ` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
265+ | M | ` publisher_stream_errors ` | If value exceeds 3 in the last 60s, otherwise, configure as needed |
266+
267+ ** Messaging** : Alerts for messaging service operations regarding block items and block notification
268+
269+ | Severity | Metric | Alert Condition |
270+ | ----------| ---------------------------------------------| -----------------------------------------------------------|
271+ | L | ` messaging_item_queue_percent_used ` | If percentage exceeds 60%, otherwise, configure as needed |
272+ | L | ` messaging_notification_queue_percent_used ` | If percentage exceeds 60%, otherwise, configure as needed |
273+
274+ ** Latency** : Alerts for latency metrics in receiving, hashing, verifying, and persisting blocks
275+
276+ | Severity | Metric | Alert Condition |
277+ | ----------| --------------------------------------------| ------------------------------------------------------|
278+ | M | ` publisher_receive_latency_ns ` | If value exceeds 20s, otherwise, configure as needed |
279+ | M | ` hashing_block_time ` | If value exceeds 2s, otherwise, configure as needed |
280+ | M | ` verification_block_time ` | If value exceeds 20s, otherwise, configure as needed |
281+ | M | ` files_recent_persistence_time_latency_ns ` | If value exceeds 20s, otherwise, configure as needed |
0 commit comments