-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Query Information
PPL Command/Query:
source=opensearch_dashboards_sample_data_logs
| rex field=message '.+"(?<RestMethod>(GET|POST)) .+?(?<HttpVersion>HTTP/[0-9\.]+)" (?<HttpStatus>\d+) (?<BytesSent>.+?) .+'
| fields RestMethod, HttpVersion, HttpStatus, BytesSent, message
| head 1
Expected Result:
{
"datarows": [
[
"GET",
"HTTP/1.1",
"200",
"6219",
"223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\""
]
]
}Actual Result:
{
"datarows": [
[
"GET",
"GET",
"HTTP/1.1",
"200",
"223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\""
]
]
}The values are shifted by one position:
RestMethodcolumn shows "GET" (correct value, but appears twice)HttpVersioncolumn shows "GET" (should be "HTTP/1.1")HttpStatuscolumn shows "HTTP/1.1" (should be "200")BytesSentcolumn shows "200" (should be "6219")
Dataset Information
Dataset/Schema Type
- OpenTelemetry (OTEL)
- Simple Schema for Observability (SS4O)
- Open Cybersecurity Schema Framework (OCSF)
- Custom (OpenSearch Dashboards sample data)
Index Mapping
{
"mappings": {
"properties": {
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"timestamp": {
"type": "date"
},
"response": {
"type": "text"
},
"request": {
"type": "text"
}
}
}
}Sample Data
{
"message": "223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\"",
"timestamp": "2025-09-28T00:39:02.912Z",
"response": 200,
"request": "/opensearch/opensearch-1.0.0.deb",
"bytes": 6219,
"clientip": "223.87.60.27"
}Bug Description
Issue Summary:
The rex command exhibits an off-by-one error when extracting multiple named capture groups that contain nested unnamed capture groups. The extracted values are shifted by one position for each unnamed group present in the pattern.
Steps to Reproduce:
- Use the sample data index
opensearch_dashboards_sample_data_logs - Execute a rex command with named capture groups containing nested unnamed groups:
source=opensearch_dashboards_sample_data_logs | rex field=message '.+"(?<RestMethod>(GET|POST)) .+?(?<HttpVersion>HTTP/[0-9\.]+)" (?<HttpStatus>\d+) (?<BytesSent>.+?) .+' | fields RestMethod, HttpVersion, HttpStatus, BytesSent | head 1 - Observe that values are shifted: RestMethod appears in both RestMethod and HttpVersion columns, HttpVersion value appears in HttpStatus column, etc.
Comparison Test:
-
Without nested groups (works correctly):
source=opensearch_dashboards_sample_data_logs | rex field=message '.+"(?<Method>\w+) (?<Path>.+?) (?<Version>HTTP/[0-9\.]+)' | fields Method, Path, Version | head 1Result: ✅ Correct - "GET", "/opensearch/opensearch-1.0.0.deb_1", "HTTP/1.1"
-
With nested groups (fails):
source=opensearch_dashboards_sample_data_logs | rex field=message '.+"(?<Method>(GET|POST)) (?<Path>.+?) (?<Version>HTTP/[0-9\.]+)' | fields Method, Path, Version | head 1Result: ❌ Incorrect - "GET", "GET", "/opensearch/opensearch-1.0.0.deb_1"
Impact:
This bug makes it impossible to use alternation patterns (e.g., (GET|POST), (error|warning)) within named capture groups, which is a common use case for log parsing. Users must work around this by:
- Using character classes instead of alternation where possible (e.g.,
\w+instead of(GET|POST)) - Post-processing results to manually shift values
- Avoiding nested groups entirely, which limits regex expressiveness
Environment Information
OpenSearch Version: 3.3.0-SNAPSHOT
Additional Details:
- Plugin: SQL/PPL Plugin (Calcite-based engine)
- Tested via REST API:
POST /_plugins/_ppl
Root Cause Analysis (Preliminary)
Location: core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (lines 261-321)
Issue:
The visitRex method uses RegexCommonUtils.getNamedGroupCandidates() to extract named group names, then creates REX_EXTRACT calls with indices i + 1 (lines 284-302):
for (int i = 0; i < namedGroups.size(); i++) {
extractCall = PPLFuncImpTable.INSTANCE.resolve(
context.rexBuilder,
BuiltinFunctionName.REX_EXTRACT,
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.relBuilder.literal(i + 1)); // ← Assumes only named groups exist
newFields.add(extractCall);
newFieldNames.add(namedGroups.get(i));
}However, Java's regex engine counts all capture groups (both named and unnamed) when assigning group indices. For pattern (?<Method>(GET|POST)):
- Group 0: entire match
- Group 1:
(?<Method>...)(named) - Group 2:
(GET|POST)(unnamed, nested)
The code assumes group 1 is the first named group, but doesn't account for unnamed groups that shift the indices.
Explain Plan Evidence:
LogicalProject(
RestMethod=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 1)],
HttpVersion=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 2)],
HttpStatus=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 3)],
...
)
The indices 1, 2, 3 don't correspond to the named groups when unnamed groups are present.
Proposed Fix (Preliminary)
Option 1: Use Named Group Extraction (Recommended)
Modify RexExtractFunction.extractGroup() to accept group names instead of indices:
public static String extractGroup(String text, String pattern, String groupName) {
Pattern compiledPattern = Pattern.compile(pattern);
Matcher matcher = compiledPattern.matcher(text);
if (matcher.find()) {
return matcher.group(groupName); // Use name instead of index
}
return null;
}Update CalciteRelNodeVisitor.visitRex() to pass group names:
for (String groupName : namedGroups) {
extractCall = PPLFuncImpTable.INSTANCE.resolve(
context.rexBuilder,
BuiltinFunctionName.REX_EXTRACT,
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.rexBuilder.makeLiteral(groupName)); // Pass name, not index
newFields.add(extractCall);
newFieldNames.add(groupName);
}Option 2: Calculate Correct Group Indices
Parse the regex pattern to count all groups (named and unnamed) and map named groups to their actual indices. This is more complex and error-prone.
Workaround
Until fixed, avoid nested unnamed capture groups:
# Instead of: (?<Method>(GET|POST))
# Use: (?<Method>GET|POST) or (?<Method>\w+)
Related Files
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java(lines 261-321)core/src/main/java/org/opensearch/sql/expression/function/udf/RexExtractFunction.java(lines 54-67)core/src/main/java/org/opensearch/sql/expression/parse/RegexCommonUtils.java(lines 60-68)integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteRexCommandIT.java(test coverage needed)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status