-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve synthetic source for tdigest field #138121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve synthetic source for tdigest field #138121
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
JonasKunz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a nit and more of a question about min/max estimation, but that can be addressed in a follow-up if needed.
| - match: | ||
| _source: | ||
| latency: | ||
| # Note that we're storing 0.1 as the min, even though it's count is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to include centroids in the min/max estimations in the estimation when their count is zero?
Good that you raised this here, I didn't notice it before.
I would assume that we should exclude it from the estimations, which would be consistent to the current min/max queryDSL aggregations on the histogram field I think?
We don't store the empty centroids, so min/max in queryDSL will return the smallest/highest populated centroid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that if the user sent in an explicit min that did not correspond to a centroid, we'd store that and it would have the same inconsistency with the QueryDSL implementation.
I also don't want to spend a lot of time on this. The empty bucket handling is an extreme edge case. Our expectation is that users will be sending in valid t-digests, which should never produce empty buckets. To get an empty bucket, the user has to not only not use the t-digest algorithm, but also do something which doesn't really make sense for most histogram types. Frankly, I'm not even sure we should treat these as valid inputs, but that's what the histogram field does and for now I'm trying to keep this as compatible as possible.
I don't think we need to decide this on this PR. Technically, this behavior was added in #137982, I just called it out in this test. We can discuss and change it later if we want to.
|
|
||
| private static final ParseField COUNTS_FIELD = new ParseField(COUNTS_NAME); | ||
| private static final ParseField CENTROIDS_FIELD = new ParseField(CENTROIDS_NAME); | ||
| private static final ParseField TOTAL_COUNT_FIELD = new ParseField(TOTAL_COUNT_FIELD_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also remove TOTAL_COUNT_FIELD_NAME from TDigestFieldMapper, as it is unused now.
Follow up to #137982 to support returning min, max, and sum in the synthetic source results. This also drops support for sending the count as a parameter (and thus doesn't include the total count in the synthetic source result), which matches the behavior of the exponential histogram field.