Skip to content

[APM Server/Integration] Report health on major errors #17878

@lucabelluccini

Description

@lucabelluccini

It would be nice to expose some major health events to the Kibana UI without the need for the users to enable the whole monitoring and logging collection.

A key example is when TBS reaches the storage limit.

At the moment, those errors can be seen only if one actively looks at logs.

Once we have the status reported, it would be good to somehow expose it to the Kibana UI and/or Cloud Admin UI.

Possible implementation via Fleet Protocol

Fleet exposes an API to report a status and health of an integration.

Image

It would be nice to report Degraded or Unhealthy to Fleet based on errors encountered in APM Server/Integration.

It is feasible using https://github.com/elastic/beats/blob/main/libbeat/management/status/status.go

For Filebeat, they use context.UpdateStatus().

Status changes should only be reported when the status actually changes (don't repeatedly send HEALTHY if the status is already HEALTHY).

The state can also contain hints on what to do (e.g. point to a documentation page for remediation).

As a safety measure, we could make this health reporting configurable at APM Integration level (e.g. report health to fleet or something like that). If this setting is enabled, APM Integration would not only log, but report the state to Fleet.

Advantages:

  • Covers self-hosted and ECH APM Integrations
  • Should be light to implement
  • Removes the need for the customer to enable logging and metrics collection, at least for major errors

Disadvantage:

  • Strongly tied to Fleet, which might not be reusable for the ECH Health API for Cloud Admin

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions