Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
- [prometheus-exporter_kong](#prometheus-exporter_kong)
- [prometheus-exporter_oracledb](#prometheus-exporter_oracledb)
- [prometheus-exporter_postfix](#prometheus-exporter_postfix)
- [prometheus-exporter_redis](#prometheus-exporter_redis)
- [prometheus-exporter_squid](#prometheus-exporter_squid)
- [prometheus-exporter_varnish](#prometheus-exporter_varnish)
- [prometheus-exporter_wallix-bastion](#prometheus-exporter_wallix-bastion)
Expand Down Expand Up @@ -139,7 +140,6 @@
|AWS CWAgent heartbeat|X|-|-|-|-|
|AWS CWAgent memory used|X|X|-|-|-|
|AWS CWAgent disk used|X|X|-|-|-|
|AWS CWAgent cpu usage active|X|X|-|-|-|


## fame_azure-automation-updates
Expand Down Expand Up @@ -951,6 +951,17 @@
|Postfix size postfix delivery delay|X|X|-|-|-|


## prometheus-exporter_redis

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Redis heartbeat|X|-|-|-|-|
|Redis blocked over connected clients ratio|X|X|-|-|-|
|Redis evicted keys change rate|X|X|-|-|-|
|Redis expired keys change rate|X|X|-|-|-|
|Redis rejected connections|X|X|-|-|-|


## prometheus-exporter_squid

|Detector|Critical|Major|Minor|Warning|Info|
Expand Down
124 changes: 124 additions & 0 deletions modules/prometheus-exporter_redis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# REDIS SignalFx detectors

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
:link: **Contents**

- [How to use this module?](#how-to-use-this-module)
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
- [How to collect required metrics?](#how-to-collect-required-metrics)
- [Metrics](#metrics)
- [Related documentation](#related-documentation)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## How to use this module?

This directory defines a [Terraform](https://www.terraform.io/)
[module](https://www.terraform.io/language/modules/syntax) you can use in your
existing [stack](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#stack) by adding a
`module` configuration and setting its `source` parameter to URL of this folder:

```hcl
module "signalfx-detectors-prometheus-exporter-redis" {
source = "github.com/claranet/terraform-signalfx-detectors.git//modules/prometheus-exporter_redis?ref={revision}"

environment = var.environment
notifications = local.notifications
}
```

Note the following parameters:

* `source`: Use this parameter to specify the URL of the module. The double slash (`//`) is intentional and required.
Terraform uses it to specify subfolders within a Git repo (see [module
sources](https://www.terraform.io/language/modules/sources)). The `ref` parameter specifies a specific Git tag in
this repository. It is recommended to use the latest "pinned" version in place of `{revision}`. Avoid using a branch
like `master` except for testing purpose. Note that every modules in this repository are available on the Terraform
[registry](https://registry.terraform.io/modules/claranet/detectors/signalfx) and we recommend using it as source
instead of `git` which is more flexible but less future-proof.

* `environment`: Use this parameter to specify the
[environment](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#environment) used by this
instance of the module.
Its value will be added to the `prefixes` list at the start of the [detector
name](https://github.com/claranet/terraform-signalfx-detectors/wiki/Templating#example).
In general, it will also be used in the `filtering` internal sub-module to [apply
filters](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance#filtering) based on our default
[tagging convention](https://github.com/claranet/terraform-signalfx-detectors/wiki/Tagging-convention) by default.

* `notifications`: Use this parameter to define where alerts should be sent depending on their severity. It consists
of a Terraform [object](https://www.terraform.io/language/expressions/type-constraints#object) where each key represents an available
[detector rule severity](https://docs.splunk.com/observability/alerts-detectors-notifications/create-detectors-for-alerts.html#severity)
and its value is a list of recipients. Every recipients must respect the [detector notification
format](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector#notification-format).
Check the [notification binding](https://github.com/claranet/terraform-signalfx-detectors/wiki/Notifications-binding)
documentation to understand the recommended role of each severity.

These 3 parameters along with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/language/values/variables) make it possible to
customize the detectors behavior to better fit your needs.

Most of them represent usual tips and rules detailed in the
[guidance](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance) documentation and listed in the
common [variables](https://github.com/claranet/terraform-signalfx-detectors/wiki/Variables) dedicated documentation.

Feel free to explore the [wiki](https://github.com/claranet/terraform-signalfx-detectors/wiki) for more information about
general usage of this repository.

## What are the available detectors in this module?

This module creates the following SignalFx detectors which could contain one or multiple alerting rules:

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Redis heartbeat|X|-|-|-|-|
|Redis blocked over connected clients ratio|X|X|-|-|-|
|Redis evicted keys change rate|X|X|-|-|-|
|Redis expired keys change rate|X|X|-|-|-|
|Redis rejected connections|X|X|-|-|-|

## How to collect required metrics?

This module deploys detectors using metrics reported by the
scraping of a server following the [OpenMetrics convention](https://openmetrics.io/) based on and compatible with [the Prometheus
exposition format](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#openmetrics-text-format).

They are generally called `Prometheus Exporters` which can be fetched by both the [SignalFx Smart Agent](https://github.com/signalfx/signalfx-agent)
thanks to its [prometheus exporter monitor](https://github.com/signalfx/signalfx-agent/blob/main/docs/monitors/prometheus-exporter.md) and the
[OpenTelemetry Collector](https://github.com/signalfx/splunk-otel-collector) using its [prometheus
receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver) or its derivatives.

These exporters could be embedded directly in the tool you want to monitor (e.g. nginx ingress) or must be installed next to it as
a separate program configured to connect, create metrics and expose them as server.


Check the [Related documentation](#related-documentation) section for more detailed and specific information about this module dependencies.

The detectors of this module uses metrics from the [prometheus redis exporter](https://github.com/oliver006/redis_exporter) plugin for Prometheus.


### Metrics


Here is the list of required metrics for detectors in this module.

* `redis_blocked_clients`
* `redis_connected_clients`
* `redis_evicted_keys_total`
* `redis_expired_keys_total`
* `redis_memory_used_bytes`
* `redis_rejected_connections_total`




## Related documentation

* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
* [Splunk Observability integrations](https://docs.splunk.com/Observability/gdi/get-data-in/integrations.html)
* [Prometheus Exporter for Redis](https://github.com/oliver006/redis_exporter)
1 change: 1 addition & 0 deletions modules/prometheus-exporter_redis/common-filters.tf
1 change: 1 addition & 0 deletions modules/prometheus-exporter_redis/common-locals.tf
1 change: 1 addition & 0 deletions modules/prometheus-exporter_redis/common-modules.tf
1 change: 1 addition & 0 deletions modules/prometheus-exporter_redis/common-variables.tf
1 change: 1 addition & 0 deletions modules/prometheus-exporter_redis/common-versions.tf
12 changes: 12 additions & 0 deletions modules/prometheus-exporter_redis/conf/00-heartbeat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module: redis
name: heartbeat
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

transformation: false
exclude_not_running_vm: true

signals:
signal:
metric: redis_memory_used_bytes
rules:
critical:
27 changes: 27 additions & 0 deletions modules/prometheus-exporter_redis/conf/01-blocked-clients.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
module: redis
name: blocked over connected clients ratio
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"


value_unit: "%"

signals:
A:
metric: redis_blocked_clients
B:
metric: redis_connected_clients
signal:
formula: (A/B).scale(100)

rules:
critical:
threshold: 5
comparator: '>'
lasting_duration: 1h
lasting_at_least: 0.5
major:
threshold: 0
comparator: '>'
lasting_duration: 1h
lasting_at_least: 0.5
dependency: critical
23 changes: 23 additions & 0 deletions modules/prometheus-exporter_redis/conf/02-evicted-keys.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module: redis
name: evicted keys change rate
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
A:
metric: redis_evicted_keys_total
rollup: delta
signal:
formula: A.rateofchange()

rules:
critical:
threshold: 50
comparator: '>'
lasting_duration: 15m
lasting_at_least: 0.5
major:
threshold: 25
comparator: '>'
lasting_duration: 15m
lasting_at_least: 0.5
dependency: critical
23 changes: 23 additions & 0 deletions modules/prometheus-exporter_redis/conf/04-expired-keys.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module: redis
name: expired keys change rate
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
A:
metric: redis_expired_keys_total
rollup: delta
signal:
formula: A.rateofchange()

rules:
critical:
threshold: 100
comparator: '>'
lasting_duration: 15m
lasting_at_least: 0.5
major:
threshold: 50
comparator: '>'
lasting_duration: 15m
lasting_at_least: 0.5
dependency: critical
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module: redis
name: rejected connections
tip: maxclient reached
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
signal:
metric: redis_rejected_connections_total
rollup: delta

transformation: ".sum(over='5m')"

rules:
critical:
threshold: 5
comparator: '>'
major:
threshold: 0
comparator: '>'
dependency: critical
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
module: redis
name: hitrate
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

value_unit: "%"

signals:
A:
metric: redis_keyspace_hits_total
rollup: delta
B:
metric: redis_keyspace_misses_total
rollup: delta
signal:
formula: (A/(A+B)).scale(100)

rules:
critical:
threshold: 0
comparator: '<'
lasting_duration: 5m
disabled: true
major:
threshold: 10
comparator: '<'
lasting_duration: 5m
dependency: critical
minor:
threshold: 30
comparator: '<'
lasting_duration: 5m
dependency: major
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
module: redis
name: high memory fragmentation ratio
runbook_url: "https://www.dynatrace.com/news/blog/introducing-redis-server-monitoring/#key-metrics"
tip: restart redis to recover memory previously unusable due to fragmentation or enable the new active defragmentation feature available since redis 4
aggregation: ".mean(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
A:
metric: redis_memory_used_rss_bytes
rollup: average
B:
metric: redis_memory_used_bytes
rollup: average
signal:
formula: (A/B)

rules:
critical:
threshold: 5
comparator: '>'
lasting_duration: 15m
lasting_at_least: 1
major:
threshold: 2
comparator: '>'
dependency: critical
lasting_duration: 15m
lasting_at_least: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
module: redis
name: low memory fragmentation ratio
runbook_url: "https://www.dynatrace.com/news/blog/introducing-redis-server-monitoring/#key-metrics"
tip: increase the memory available on the host or reduce the memory usage from your application
aggregation: ".mean(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
A:
metric: redis_memory_used_rss_bytes
rollup: average
B:
metric: redis_memory_used_bytes
rollup: average
signal:
formula: (A/B)

rules:
critical:
threshold: 0.75
comparator: '<'
lasting_duration: 15m
lasting_at_least: 0.5
major:
threshold: 1
comparator: '<'
dependency: critical
lasting_duration: 15m
lasting_at_least: 0.5
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
module: redis
name: stored keys change rate
tip: no change on keyspace over a long period can indicate it is full. if you don't use redis as cache but as queue broker or database so it can be normal to not see any activity depending on your application and you should disable this detector
disable: true
aggregation: ".sum(by=['k8s.workload.name', 'k8s.namespace.name', 'k8s.cluster.name'], allow_missing=True)"

signals:
A:
metric: redis_db_keys
signal:
formula: A.rateofchange().abs()

rules:
major:
threshold: 0
comparator: '=='
lasting_duration: 1h
3 changes: 3 additions & 0 deletions modules/prometheus-exporter_redis/conf/module.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
detect:
metric:
- redis_memory_used_bytes
Loading
Loading