Skip to content

[Meta] Batch Size Metrics Phase 2 - metering over time windows and percentiles #18243

@robbavey

Description

@robbavey

Follow on from #17838

Update metrics exposed by pipelines to expose the statistic view of batch structure.

This is helpful to understand if a batch is fulfilled or if the total size of events it contains is too big.
Phase 2 adds the functionality to expose percentiles, and values in time windows of 1, 5, 15 minutes.

To collect such kind of information an histogram metric should be used. The histograms used should be contained or referenced by the pipeline and the writer must write in thread safe way, in case of HdrHistogram using the ´Recorder` class. On the read side, which is expected to be done by the metrics collector in a single thread, the sum up of partial histograms into global ones, should be thread safe as experimented in the https://gist.github.com/andsel/b56ba80e9bef1aaa95cf435f2366109b.

This feature should be implemented behind a feature flag, so that in contexts where it proves to be a bottleneck we give an escape path to our customer. However, from the analysis done, the computation of the size shouldn’t impact performance too much.
The feature flag can be global, usually set in logstash.yml or per pipeline in pipelines.yml. The value of the flag at pipeline level has precedence over the one globally specified.
The feature flag is not just a boolean but can be none (to disable the collection of such metrics), minimal to collect only 1% of the total batches or full to collect data from every batch.

Phase 2 (provide metering over time windows and percentiles):

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions