-
Notifications
You must be signed in to change notification settings - Fork 177
Fix bin nested fields issue #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| originalFieldNames.stream() | ||
| .filter(newNames::contains) | ||
| .filter( | ||
| originalName -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract for readability / maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted
| // Match exact field names (e.g., "age" == "age") | ||
| // OR nested paths (e.g., "resource.attributes..." starts with | ||
| // "resource") | ||
| newName.equals(originalName) | ||
| || newName.startsWith(originalName + "."))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean it could override all the original fields start with originalName + "."? (like when original fields include resource.a, resource.b, etc. then all of them will be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the question! No - originalFieldNames only contains top-level field names like "resource", not nested paths like "resource.a" or "resource.b".
So when binning resource.c.nested, the filter only matches the top-level "resource" field (via startsWith("resource.")). This causes the entire resource struct to be removed and flattened, with the binned field replacing the original nested value.
Only one top-level struct gets matched and processed, regardless of how many nested fields it contains.
You can refer to this IT test I added: testBinWithNestedFieldWithoutExplicitProjection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field name could contain . and newName.startsWith("resource.") could match with multiple fields in my understanding.
I am unsure how much we currently support field name containing ., but QualifiedNameResolver implements longest match logic to decide the referred field in case qualified name contains multiple dots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be an unusual case for field name to contain mutliple .?
Looking at the end-to-end flow, originalFieldNames comes from:
context.relBuilder.peek().getRowType().getFieldNames()
This returns only the direct fields of the current RelNode's row type. For a schema with nested structures, the row type would have:
- Top-level field: "resource" (type: STRUCT)
- Top-level field: "severityNumber" (type: INT)
It would not contain "resource.a", "resource.b", etc. as separate top-level fields - those only exist as nested fields within the resource struct definition.
The scenario where multiple top-level fields share a prefix (like resource, resource.a, resource.b all being top-level) would require an unusual schema design(I think we can rarely meet this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenSearch mapping key could include ., and also we can easily introduce field which contains . with eval command.
Why can't we directly match with the actual field we used?
projectPlusOverriding is used for multiple commands, and could have side effect to other commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we directly match with the actual field we used?
We DO directly match. The logic uses both:
- Exact match: newName.equals(originalName)
- Struct match: newName.startsWith(originalName + ".")
This exact logic is already used by eval. Example:
source=telemetry | eval resource.temp = 1 | bin resource.attributes.telemetry.sdk.version span=2
Step 1 - Eval:
- "resource.temp".startsWith("resource.") → true → flattens struct
- Schema becomes: ["resource.attributes.telemetry.sdk.enabled", "resource.attributes.telemetry.sdk.language", ..., "resource.temp"]
Step 2 - Bin:
- Schema already flattened, exact match works: "resource.attributes.telemetry.sdk.version".equals("resource.attributes.telemetry.sdk.version") → true
Both commands use the same logic safely. Added test (testBinWithEvalCreatedDottedFieldName) to confirm no edge case issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @ykmr1224 means if we can directly know the sub field of a struct and project it, instead of relying on the crispy name matching. E.g. if a struct is info {a: string, b: int}, we can directly project info, info.a, info.b instead of relying on pattern info.* to match the names, because there may be another schema info: string, info.a integer, where info.a is not a sub-field of info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuancu Thanks for the clarification! However, I think the current logic handles this situation, you can refer to this newly added integration test using eval command to test: testBinWithEvalCreatedDottedFieldName
For the edge case you mentioned, I think it would be problematic if we have two info.a with different type because that would cause confusion for the commands. For example, when we are using stats count(info.a), which info.a is it referring to? For other cases where there's no naming overlap, I think you can refer to the eval test case
| newName.equals(originalName) | ||
| // OR match nested paths (e.g., "resource.attributes..." starts with | ||
| // "resource.") | ||
| || newName.startsWith(originalName + ".")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any other place using this pattern of detection? Or if there is a better way to detect if it is a nested field or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CalciteRelNodeVisitor.java, the visitFlatten() method uses the same startsWith() pattern:
public RelNode visitFlatten(Flatten node, CalcitePlanContext context) {
visitChildren(node, context);
RelBuilder relBuilder = context.relBuilder;
String fieldName = node.getField().getField().toString();
// Match the sub-field names with "field.*"
List<RelDataTypeField> fieldsToExpand =
relBuilder.peek().getRowType().getFieldList().stream()
.filter(f -> f.getName().startsWith(fieldName + ".")) // ← Same pattern!
.toList();
// ... rest of method
}
(cherry picked from commit 85dc8d9) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* default-main: (34 commits) Enhance dynamic source clause to support only metadata filters (opensearch-project#4554) Make nested alias type support referring to outer context (opensearch-project#4673) Update big5 ppl queries and check plans (opensearch-project#4668) Support push down sort after limit (opensearch-project#4657) Use table scan rowType in filter pushdown could fix rename issue (opensearch-project#4670) Fix: Support Alias Fields in MIN, MAX, FIRST, LAST, and TAKE Aggregations (opensearch-project#4621) Fix bin nested fields issue (opensearch-project#4606) Add `per_minute`, `per_hour`, `per_day` function support (opensearch-project#4531) Pushdown sort aggregate metrics (opensearch-project#4603) Followup: Change ComparableLinkedHashMap to compare Key than Value (opensearch-project#4648) Mitigate the CI failure caused by 500 Internal Server Error (opensearch-project#4646) Allow renaming group-by fields to existing field names (opensearch-project#4586) Publish internal modules separately for downstream reuse (opensearch-project#4484) Revert "Update grammar files and developer guide (opensearch-project#4301)" (opensearch-project#4643) Support Automatic Type Conversion for REX/SPATH/PARSE Command Extractions (opensearch-project#4599) Replace all dots in fields of table scan's PhysType (opensearch-project#4633) Return comparable LinkedHashMap in `valueForCalcite()` of ExprTupleValue (opensearch-project#4629) Refactor JsonExtractAllFunctionIT and MapConcatFunctionIT (opensearch-project#4623) Pushdown case function in aggregations as range queries (opensearch-project#4400) Update GEOIP function to support IP types as input (opensearch-project#4613) ... # Conflicts: # docs/user/ppl/functions/conversion.rst
* default-main: (34 commits) Enhance dynamic source clause to support only metadata filters (opensearch-project#4554) Make nested alias type support referring to outer context (opensearch-project#4673) Update big5 ppl queries and check plans (opensearch-project#4668) Support push down sort after limit (opensearch-project#4657) Use table scan rowType in filter pushdown could fix rename issue (opensearch-project#4670) Fix: Support Alias Fields in MIN, MAX, FIRST, LAST, and TAKE Aggregations (opensearch-project#4621) Fix bin nested fields issue (opensearch-project#4606) Add `per_minute`, `per_hour`, `per_day` function support (opensearch-project#4531) Pushdown sort aggregate metrics (opensearch-project#4603) Followup: Change ComparableLinkedHashMap to compare Key than Value (opensearch-project#4648) Mitigate the CI failure caused by 500 Internal Server Error (opensearch-project#4646) Allow renaming group-by fields to existing field names (opensearch-project#4586) Publish internal modules separately for downstream reuse (opensearch-project#4484) Revert "Update grammar files and developer guide (opensearch-project#4301)" (opensearch-project#4643) Support Automatic Type Conversion for REX/SPATH/PARSE Command Extractions (opensearch-project#4599) Replace all dots in fields of table scan's PhysType (opensearch-project#4633) Return comparable LinkedHashMap in `valueForCalcite()` of ExprTupleValue (opensearch-project#4629) Refactor JsonExtractAllFunctionIT and MapConcatFunctionIT (opensearch-project#4623) Pushdown case function in aggregations as range queries (opensearch-project#4400) Update GEOIP function to support IP types as input (opensearch-project#4613) ... Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Description
Fixed bin command failing on nested fields (e.g., resource.attributes.telemetry.sdk.version).
Updated
CalciteRelNodeVisitor.projectPlusOverriding()to use prefix matching instead of exact matching for nested field names.Related Issues
bincommand fails to produce output for nested/struct fields #4482