Skip to content

Conversation

@w1ll-i-code
Copy link
Contributor

PR for #10576. For right now, this will be hidden behind a build-flag, until the feature is sufficiently tested and ready for production use. For all the changes made to the original elasticsearch writer, please consult the individual commits.

Copy the elasticsearchwriter to be adapted for the new datastreamwriter.
Setup the build system and hide the feature behind a build flag for now.
Restructure the data sent to elasticsearch to align with the Elastic
Common Schema specification and separate document indices by check to
reduce the number of distinct fields per index.
Handle the errors returned by elasticsearch gracefully to let the user
know what went wrong during execution. Do not discard data as soon as a
request to elasticsearch fails, but keep retrying until the data can be
sent, relying on the WorkQueue for back pressure. Improve the handling
of the flush timer, by rescheduling the timer after each flush, making
sure that there are no needless flushes under heavy load.
Re-add support for tags, but make them conform to the ecs specification. Add
also support for labels, which are the former tags in the
elasticsearchwriter.

ref: https://www.elastic.co/docs/reference/ecs/ecs-base
Allow the user to filter for which data should be sent to elasticsearch.
This allows the user to use multiple datastreams to send the data to
different namespaces, for multi-tenancy of different retention policies.
Accept also an ApiKey for authentication in elasticsearch in addition to
username + password and certificates.
Icinga2 should manage its own index template if possible. This allows us
to ship potential additions to the document later without requiring
further user input. However the user has the option to disable the
feature, should they want to manage the template manually. For user
customization, we ship the `icinga2@custom` component template, so that
users can change the behaviour of the template without having to edit
the managed one, making updates easier.
Add a template config with all possible config options for the user as
well as a short description on what the parameter does and how it can be
used. This allows a user to quickly configure the writer without having
to look up lots of documentation online.
Update the documentation to give a clearer overview of the new
elasticsearchdatastreamwriter feature and add the object and its fields
to the syntax highlighting of nano and vim.
Drop messages in the data buffer if the connection to elasticsearch
fails. This guarantees that the icinga2 process can still shut down,
even with a missconfigured writer or if elasticsearch is down or not
reachable without stalling.
Allow the 'datasteam_namespace' variable to contain macros and expand
them properly. This allows data to be written into different datastreams
based on object properties, e.g. by zone or custom var.
@cla-bot cla-bot bot added the cla/signed label Oct 7, 2025
Copy link
Contributor

@mcodato mcodato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about putting this feature behind a build-flag that is OFF by default.
IMHO, having it compiled by default, when the PR is merged, could make adoption easier and faster. The option to enable or disable the feature already exists.
WDYT @lippserd ?

Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave this a quick look, so this is not a complete review.

Firstly, I agree with @mcodato that the build flag is unnecessary. It can just be on by default same as all the other perfdata writers.

Secondly, I'd suggest you squash down your commits because they're all operating on the same new component added by this PR. Also there's some whitespace back-and-forth cluttering up the diff.

See below for a some additional comments on the code, mostly things the linter complained about:

@w1ll-i-code
Copy link
Contributor Author

#10577 (review) Hi, thanks for having a look at it, I really appreciate it.

I'd suggest you squash down your commits [...]

I left the commits separate deliberately, so it's easier to review. Each commit can be looked at on their own, without having to reason about all changes at once.

@w1ll-i-code w1ll-i-code force-pushed the wp_elasticsearchdatastreamwriter branch 6 times, most recently from 4a9699b to 9a83f44 Compare October 21, 2025 15:06
@jschmidt-icinga
Copy link
Contributor

Thank you for this PR. I've briefly spoken to @lippserd about this and I'm going to take another look when I can. I know next to nothing about Elasticsearch yet, so it might take some until I can test this thoroughly.

@martialblog
Copy link
Member

martialblog commented Oct 23, 2025

Hi,

Just a hint, I recently addressed other issue with the ElasticsearchWriter format here #10511

I provided a small but breaking change here: #10518

If the ElasticsearchDatastreamWriter is introduced, both Writers should use the same format, right?

@jschmidt-icinga let me know if you need help with Elasticsearch knowhow, you know where my office is.

@martialblog
Copy link
Member

FYI, I think this would also (somewhat) addresses this issue #9837

With datastreams the backing indices are managed by Elasticsearch.

if (!pdv->GetCrit().IsEmpty() && GetEnableSendThresholds())
metric->Set("crit", pdv->GetCrit());

pd_fields->Set(pdv->GetLabel(), metric);
Copy link
Member

@martialblog martialblog Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Am I reading this correct?
This would result in a schema where the perfdata label is in the field name?

Like so:

"_source": {
  "timestamp": "1761225482",
  "perfdata.rta": {
      "value": 0.091,
      "warn": 100,
      "min": 0,
      "crit": 200,
      "unit": "ms"
  },
  "perfdata.pl": {
      "value": 0,
      "warn": 5,
      "min": 0,
      "crit": 15,
      "unit": "%"
  }
}

I think this might cause issues in the longterm, as described here: #6805 and #10511

Since each new field will cause a new mapping in the index. Correct me if I'm wrong.

The issue with adding new fields for every label is this: https://www.elastic.co/docs/troubleshoot/elasticsearch/mapping-explosion

In the change to the Elasticwriter I proposed, I used a field called "label" in a list of objects.

Like so:

"_source": {
  "timestamp": "1761225482",
  "perfdata": [
    {
      "value": 0.091,
      "label": "rta"
      "warn": 100,
      "min": 0,
      "crit": 200,
      "unit": "ms"
    },
    {
      "value": 0,
      "label": "pl"
      "warn": 5,
      "min": 0,
      "crit": 15,
      "unit": "%"
    }
  ]
}

This would just create a single field. And there is an expected field name where to find the label.

Otherwise you need to fetch the entire document, scan for all of the fields that start with "perfdata." to find the perfdata fields... and would still need to split the key name at the . to get to the label, right?

This field can then the either object or nested:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make things a bit clearer. When using an array using nested here might be better, since it maintains the independence of each object.

For example, when searching all perfdata for an object:

curl -X POST "http://localhost:9200/icinga2/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "host": "myhost" }},
        { "match": { "service": "ping6" }}
      ]
    }
  },
   "fields": [
     "check_result.perfdata.*"
   ],
   "_source": false
}
'

An object mapping returns this:

{
  "_id": "AZoRQYZWWEqPxFmVW73R",
  "fields": {
    "check_result.perfdata.unit": [ "ms", "%" ],
    "check_result.perfdata.label": [ "rta", "pl" ],
    "check_result.perfdata.warn": [ 100, 5 ],
    "check_result.perfdata.min": [ 0, 0 ],
    "check_result.perfdata.value": [ 0, 0 ],
    "check_result.perfdata.crit": [ 200, 15 ]
  }
}

And a nested mapping returns:

{
 "_id": "AZoRUkSmucmeEsXVMUwm",
 "fields": {
 "check_result.perfdata": [
   {
     "warn": [ 100 ],
     "unit": [ "ms" ],
     "min": [ 0 ],
     "crit": [200 ],
     "label": [ "rta" ],
     "value": [ 0 ]
   },
   {
     "warn": [ 5 ],
     "unit": [ "%" ],
     "min": [ 0 ],
     "crit": [ 15 ],
     "label": ["pl" ],
     "value": [ 0 ]
   }
 ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know the puzzle piece you are missing:
We are indexing the documents based on the check, similar to what elasticsearch already does (e.g. beats might store data in metrics-system.diskio-master).

If you read the config file, you might realize that we are doing something similar with metrics-icinga2.<check>-<namespace>. Since the same check should always produce the same output, we won't have an explosion of fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, neither object nor nested work with timeseries datastreams. (If you know any better, feel free to correct me on that.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see. Thanks for the clarification. So each check would get its own datastream and thus indices. 👍 I hadn't had the time to compile and test the new writer yet, just skimmed over the code.

Are there any things we need to consider when having a datastream per check? There also is a limit on the number of shards per node. https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/miscellaneous-cluster-settings

Might not be an issue, just wanted to mention it. And users can always add more data nodes to scale up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had some time to test it. Looks good.

curl localhost:9200/_cat/indices
yellow open .ds-metrics-icinga2.swap-default-2025.10.30-000001      CBEhsKj5QweXa12A4PFafQ 1 1 17 0 224.4kb 224.4kb 224.4kb
yellow open .ds-metrics-icinga2.ping4-default-2025.10.30-000001     fe8buP9QROOWv5xt6qxI5A 1 1 16 0 221.3kb 221.3kb 221.3kb
yellow open .ds-metrics-icinga2.ping6-default-2025.10.30-000001     kqhy-33dRl2CtNw2tWe7MQ 1 1 15 0 214.6kb 214.6kb 214.6kb
yellow open .ds-metrics-icinga2.ssh-default-2025.10.30-000001       LKgw--JYSF66eAEZNhZFIQ 1 1 15 0 213.8kb 213.8kb 213.8kb

I still wonder how the amount of incides/shards might affect larger setups.

But the search does what it should do:

curl -X POST "localhost:9200/metrics-icinga2.load-default/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "service.name": "3ecab54cb896!load" }}
      ]
    }
  },
   "fields": [
     "perfdata.*"
   ],
   "_source": false
}

It's still a bit tricky to fetch the labels for each metric (load1, load5, etc), since it's in a key, but it's an improvement over the current ElasticsearchWriter. Example result:

{
  "_index": ".ds-metrics-icinga2.load-default-2025.10.30-000001",
  "_id": "Cj19dEiEhn-MNA9xAAABmjWiFKQ",
  "_score": 0.03390155,
  "fields": {
    "perfdata.load15.min": [
      0.0
    ],
    "perfdata.load1.min": [
      0.0
    ],
    "perfdata.load5.value": [
      2.22
    ],
    "perfdata.load5.min": [
      0.0
    ],
    "perfdata.load15.value": [
      2.63
    ],
    "perfdata.load1.value": [
      2.16
    ]
  }
}

when I don't query for the fields, it's a bit simpler:

...
          "perfdata": {
            "load1": {
              "min": 0,
              "value": 2.3
            },
            "load15": {
              "min": 0,
              "value": 2.2
            },
            "load5": {
              "min": 0,
              "value": 2
            }
          }
...

Overall very nice solution.

"mode": "time_series",
"routing_path": "check.checkable",
"mapping": {
"total_fields.limit": 2000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do this. The default limit of 1000 is high enough and instead we should avoid writing so many fields.

By that I mean, instead of using the perfdata label in a field name, we should have it in a field "label"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not get much over this, but in my tests I found the icingadb check in particular to write a lot of performance data into its datastream metrics-icinga2.icingadb-default. I don't think any other check will go far beyond that.

}
}
},
"mappings": {
Copy link
Member

@martialblog martialblog Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could use a mapping closer to the current ElasticsearchWriter?

It should contain everything we need for a timeseries. Instead of a new "checkable" fields, we could add "time_series_dimension": true to both the host and service fields, probably also the checkcommand.

For example:

{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "check_command": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "check_result": {
          "properties": {
            "check_source": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "command": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "execution_end": {
              "type": "date"
            },
            "execution_start": {
              "type": "date"
            },
            "execution_time": {
              "type": "float"
            },
            "exit_status": {
              "type": "long"
            },
            "latency": {
              "type": "float"
            },
            "output": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "perfdata": {
                "type": "nested",
                "properties": {
                    "crit": {
                        "type": "long"
                    },
                    "label": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "max": {
                        "type": "long"
                    },
                    "min": {
                        "type": "long"
                    },
                    "unit": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "value": {
                        "type": "long"
                    },
                    "warn": {
                        "type": "long"
                    }
                }
            },
            "schedule_end": {
              "type": "date"
            },
            "schedule_start": {
              "type": "date"
            },
            "state": {
              "type": "long"
            },
            "vars_after": {
              "properties": {
                "attempt": {
                  "type": "long"
                },
                "reachable": {
                  "type": "boolean"
                },
                "state": {
                  "type": "long"
                },
                "state_type": {
                  "type": "long"
                }
              }
            },
            "vars_before": {
              "properties": {
                "attempt": {
                  "type": "long"
                },
                "reachable": {
                  "type": "boolean"
                },
                "state": {
                  "type": "long"
                },
                "state_type": {
                  "type": "long"
                }
              }
            }
          }
        },
        "current_check_attempt": {
          "type": "long"
        },
        "host": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "last_hard_state": {
          "type": "long"
        },
        "last_state": {
          "type": "long"
        },
        "max_check_attempts": {
          "type": "long"
        },
        "reachable": {
          "type": "boolean"
        },
        "service": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "state": {
          "type": "long"
        },
        "state_type": {
          "type": "long"
        },
        "timestamp": {
          "type": "date"
        },
        "type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a _component_template that is used for both Writers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right on the checkable. We can make the host and service name timeseries dimensions. But since we are already writing checks into their own datastreams, I don't think the check_command needs to be a dimension as well.

Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also commented a few more nitpicks, but for the moment I want to primarily focus on two areas:

  • Finalize a format/index-template that can also be used by the basic ElasticsearchWriter in the future. (Please see @martialblog's detailed comments in addition to mine)
  • Minimizing the impact this has on ProcessCheckResult execution times by moving most stuff into the work queue.

I'm open to suggestions, especially regarding the data format.

Comment on lines +34 to +37
/* Enable sending the threashold values as additional fields
* with the service check metrics. If set to true, it will
* send warn and crit for every performance data item.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* Enable sending the threashold values as additional fields
* with the service check metrics. If set to true, it will
* send warn and crit for every performance data item.
*/
/* Enable sending the threshold values as additional fields
* with the service check metrics. If set to true, it will
* send warn and crit for every performance data item.
*/

Comment on lines +176 to +178
} catch (const StatusCodeException& es) {
int status_code = es.GetStatusCode();
if (status_code == static_cast<int>(http::status::bad_request)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this.
the underlying type of http::status is unsigned, so you should at least use that in StatusCodeException instead of int. However even better would be to move the Exception class to the cpp file and directly store the enum-type.

Suggested change
} catch (const StatusCodeException& es) {
int status_code = es.GetStatusCode();
if (status_code == static_cast<int>(http::status::bad_request)) {
} catch (const StatusCodeException& es) {
if (es.GetStatusCode() == http::status::bad_request) {

Comment on lines +100 to +101
StatusCodeException(int status_code, String message, String body)
: std::runtime_error(std::move(message)), m_StatusCode(status_code), m_Body(std::move(body)) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the status_code should be part of the message passed to std:.runtime_error, since DiagnosticInformation() will only print that and you're not catching the StatusCodeException directly in every instance, so that information would get lost otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I did not know that. I'll make sure to add more information to the error message.

${ICINGA2_CONFIGDIR}/features-available
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous newline

namespace http = beast::http;

static void ExceptionHandler(const boost::exception_ptr& exp);
static Array::Ptr ExtractTemplateTags(const MacroProcessor::ResolverList& resolvers, const Array::Ptr& tags_tmpl,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags_tmpl vs. tag_tmpl in the definition.

return Convert::ToBool(m_CompiledFilter->Evaluate(frame));
}

void ElasticsearchDatastreamWriter::CheckResultHandler(const Checkable::Ptr& checkable, const CheckResult::Ptr& cr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is executed synchronously during ProcessCheckResult, which is a bottleneck as it is, this should only do as much as it really needs to, then enqueue the rest of the work in the work queue, similarly to ElasticsearchWriter since #10420.

Essentially, just the state-information from the checkable needs to be gathered beforehand, so basically set those fields of ecs_host and ecs_service here, then pass it into the work queue, fill out the rest and build the document there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I actually never ran into the issue where ProcessCheckResult was the bottleneck and wanted to optimize for write speed to elasticsearch instead by having the values pre-computed. I'll restructure it as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessCheckResult is probably the busiest crossing in the entire program, depending on what features are enabled and how many checks are performed at which intervals. While it may look like it's blocking per-checkable, it's essentially blocking per-endpoint connection. We have been talking about making it process check results in parallel, at least per-checkable, but I don't think anybody is currently working on that.

I don't think write speed to elasticsearch will be impacted by moving these things onto the queue, since you have to go through the queue once anyway (currently just for sending) and at that point it doesn't matter on which side of the of the work queue the data is put together.

Comment on lines +382 to +408
Dictionary::Ptr ecs_host = new Dictionary({
{"name", host->GetDisplayName()},
{"hostname", host->GetName()},
{"soft_state", host->GetState()},
{"hard_state", host->GetLastHardState()}
});
if (!host->GetZoneName().IsEmpty()) ecs_host->Set("zone", host->GetZoneName());

char _addr[16];
if (!host->GetAddress().IsEmpty() && inet_pton(AF_INET, host->GetAddress().CStr(), _addr) == 1) {
ecs_host->Set("ip", host->GetAddress());
} else if (!host->GetAddress6().IsEmpty() && inet_pton(AF_INET6, host->GetAddress6().CStr(), _addr) == 1) {
ecs_host->Set("ip", host->GetAddress6());
} else if (!host->GetAddress().IsEmpty()) {
ecs_host->Set("fqdn", host->GetAddress());
}

Dictionary::Ptr ecs_service;
if (service) {
ecs_service = new Dictionary({
{"name", service->GetName()},
{"display_name", service->GetDisplayName()},
{"soft_state", service->GetState()},
{"hard_state", service->GetLastHardState()}
});
if (!service->GetZoneName().IsEmpty()) ecs_service->Set("zone", service->GetZoneName());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ecs_service and ecs_host are identical in structure and in case this check result is for a service, the host's state isn't really relevant anyway. On the other hand, we're missing information about reachability and the current/max attempts. This also seems to make it harder on the consumer side to evaluate, given that they would need to check if service exists and then either work with the data from the "host" or the "service" subobject.

Unless there is a good reason for doing it this way, I'd suggest sticking with the way it's done in the old ElasticsearchWriter, have a generic checkable object ecs_checkable with a host field and an optional service field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split them like this, because the objects host and service have a special meaning in the elasticsearch common schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then it makes perfect sense filling them out this way. I'm still very new to elasticsearch and will have to look at the ECS. Up to now I was operating under the assumption that the structure of the data sent to elasticsearch was entirely arbitrary.

Could you please add a link to the relevant documentation for these fields in the code where they are filled out? I think it would be helpful as a reminder and for other people reading the code that are as unfamiliar with es as myself. 😄


String checkable_name = service == nullptr ? host->GetName() : service->GetName();
Dictionary::Ptr check_result = new Dictionary({
{"checkable", checkable_name},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be redundant, given that we already have this information in the host/service/checkable object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. @martialblog has mentioned that as well. we can remove that.

@w1ll-i-code w1ll-i-code force-pushed the wp_elasticsearchdatastreamwriter branch from 9a83f44 to 9b202be Compare October 28, 2025 07:40
@w1ll-i-code w1ll-i-code force-pushed the wp_elasticsearchdatastreamwriter branch 2 times, most recently from 4f050e7 to 9e06023 Compare October 28, 2025 07:41
General improvements to code and documentation. Fixing comments by:
- Mattia Codato
- Johannes Schmidt
@w1ll-i-code w1ll-i-code force-pushed the wp_elasticsearchdatastreamwriter branch from 9e06023 to 5c270a5 Compare October 28, 2025 07:42
@w1ll-i-code
Copy link
Contributor Author

Thanks @martialblog for the insight. I'd encourage you to read the commit messages, as I tried to lay out my reasoning for some decisions there. Also, take a look at the example config file, as I think it clears up some of your concerns.

Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the filter option today and I think that's a very good idea to add something like this to a perfdata writer. We should definitely update the other writers with something similar when we get the chance.

BOOST_THROW_EXCEPTION(ValidationError(this, { "filter" }, "Filter must be an expression."));
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous newline

Comment on lines +822 to +823
std::unique_ptr<Expression>(MakeLiteral(filter)),
std::unique_ptr<Expression>(MakeLiteral("call"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be indented.

}

void ElasticsearchDatastreamWriter::ValidateFilter(const Lazy<Value> &lvalue, const ValidationUtils &utils) {
Value filter = lvalue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to take this by value, make it a const reference.

Comment on lines +825 to +827
FunctionCallExpression *fexpr = new FunctionCallExpression(std::move(indexer), std::move(args));

m_CompiledFilter = std::move(fexpr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just assign this directly. Move doesn't do anything here, since this is a raw pointer.

* You can use any attribute of the host, service, checkable or
* checkresult (cr) objects in the filter expression.
*/
// filter = {{ "host.name == 'myhost' || service.name == 'myservice'" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this quoted inside the lambda? Furthermore, if you remove the outer quotes, the inner single-quotes are not valid DSL syntax.

Comment on lines +331 to +333
Host::Ptr host;
Service::Ptr service;
tie(host, service) = GetHostService(checkable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use structured bindings, like so:

Suggested change
Host::Ptr host;
Service::Ptr service;
tie(host, service) = GetHostService(checkable);
auto [host, service] = GetHostService(checkable);

Comment on lines +336 to +339
frameNS->Set("checkable", checkable);
frameNS->Set("check_result", cr);
frameNS->Set("host", host);
frameNS->Set("service", service);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"checkable" and "check_result" are redundant here. The user would need some logic anyway to know if they're dealing with a host or service, and using this would just reverse that logic. Sure, you could do something like match(checkable.name, {{{test*}}}) to get both the host and all its services, but at the cost of introducing a new convention, since we mostly only bind "host", "service" and "obj" in cases like the API-filters.

And the check result can also be retrieved via (host|service).last_check_result.

Comment on lines +360 to +362
Host::Ptr host;
Service::Ptr service;
tie(host, service) = GetHostService(checkable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use structured bindings here please.

Comment on lines +12 to +17
/* The datastream namespace to use. This can be used to separate different
* Icinga instances. Or for letting multiple Writers write to different
* datastreams in the same Elasticsearch cluster. By using the filter option.
* The Elasticsearch datastream name will be
* "metrics-icinga2.{check}-{datastream_namespace}".
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block comments in this file are a bit inconsistent. Can you please ensure they're in this format:

/* Foo
 * bar
 */

Sorry for nitpicking again 😄

> **Note**
>
> This feature requires Elasticsearch to support timeseries datastreams and have the ecs component template installed.
> This feature was tested successfully with Elasticsearch 8.12 and 9.0.8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly mention that the ElasticsearchDatastreamWriter will not work with OpenSearch? Since the current ElasticsearchWriter works with OpenSearch, users might expect the same here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some testing with OpenSearch, luckily there is little that needs to change for this implementation to work with OpenSerch.

  • The index-template.json needs to be changed, the users can do that themselves
  • OpenSearch responds with charset=UTF-8 not charset=utf-8

0001-Opensearch.patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w1ll-i-code Can we loosen the charset requirement? Modern JSON should use UTF-8 anyways. https://www.rfc-editor.org/rfc/rfc8259

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8

A "tolower()" might also work, but maybe @jschmidt-icinga has some ideas on how the codebase should handle it.

I can take care of the documentation regarding OpenSearch once this PR is merged.

Comment on lines +633 to +641
if (response.result_int() > 299) {
Log(LogCritical, "ElasticsearchDatastreamWriter")
<< "Unexpected response code " << response.result_int() << " from URL '" << url->Format() << "'. Error: " << response.body();
BOOST_THROW_EXCEPTION(StatusCodeException(
response.result_int(),
"Unexpected response code from Elasticsearch",
response.body()
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not unconditionally log this to an error. For example if the template already exists a 400 error is returned without this being an error on our side that the user should be aware of.

So either make this LogDebug here and log errors when appropriate in the catch block outside TrySend (which you already do), or ensure we never expect error status codes unless something actually goes wrong, for example by checking if the template already exists before sending it.

@martialblog
Copy link
Member

@w1ll-i-code Maybe we need to handle fact that CheckCommand object can have names that are not allowed and index names.

For example:

object CheckCommand "example_FOOBAR_bar-foo" {
  import "ping-common"
  command += [ "-4" ]
  vars.ping_address = "$address$"
}

apply Service "ping1" {
 import "generic-service"
 check_command = "example_FOOBAR_bar-foo"
 assign where host.address
}

object CheckCommand "i like icinga" {
  import "ping-common"
  command += [ "-4" ]
  vars.ping_address = "$address$"
}

apply Service "foobar" {
 import "generic-service"
 check_command = "i like icinga"
 assign where host.address
}

This will result in:

[2025-10-31 13:15:29 +0000] warning/ElasticsearchDatastreamWriter: Error during document creation: illegal_argument_exception: data_stream [metrics-icinga2.example_FOOBAR_bar-foo-default] must be lowercase

[2025-10-31 13:15:51 +0000] warning/ElasticsearchDatastreamWriter: Error during document creation: illegal_argument_exception: data_stream [metrics-icinga2.i like icinga-default] must not contain the following characters ['?','*','<','"',' ','','/',',','|','>']

Not sure yet what a good solution might be. Maybe some form of normalization, maybe just letting the users know they need to be careful with the object names, maybe both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants