Skip to content

[BUG] rex command off-by-one error with nested capture groups in named groups #4466

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

source=opensearch_dashboards_sample_data_logs
| rex field=message '.+"(?<RestMethod>(GET|POST)) .+?(?<HttpVersion>HTTP/[0-9\.]+)" (?<HttpStatus>\d+) (?<BytesSent>.+?) .+'
| fields RestMethod, HttpVersion, HttpStatus, BytesSent, message
| head 1

Expected Result:

{
  "datarows": [
    [
      "GET",
      "HTTP/1.1",
      "200",
      "6219",
      "223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\""
    ]
  ]
}

Actual Result:

{
  "datarows": [
    [
      "GET",
      "GET",
      "HTTP/1.1",
      "200",
      "223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\""
    ]
  ]
}

The values are shifted by one position:

  • RestMethod column shows "GET" (correct value, but appears twice)
  • HttpVersion column shows "GET" (should be "HTTP/1.1")
  • HttpStatus column shows "HTTP/1.1" (should be "200")
  • BytesSent column shows "200" (should be "6219")

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (OpenSearch Dashboards sample data)

Index Mapping

{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "timestamp": {
        "type": "date"
      },
      "response": {
        "type": "text"
      },
      "request": {
        "type": "text"
      }
    }
  }
}

Sample Data

{
  "message": "223.87.60.27 - - [2018-07-22T00:39:02.912Z] \"GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1\" 200 6219 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\"",
  "timestamp": "2025-09-28T00:39:02.912Z",
  "response": 200,
  "request": "/opensearch/opensearch-1.0.0.deb",
  "bytes": 6219,
  "clientip": "223.87.60.27"
}

Bug Description

Issue Summary:
The rex command exhibits an off-by-one error when extracting multiple named capture groups that contain nested unnamed capture groups. The extracted values are shifted by one position for each unnamed group present in the pattern.

Steps to Reproduce:

  1. Use the sample data index opensearch_dashboards_sample_data_logs
  2. Execute a rex command with named capture groups containing nested unnamed groups:
    source=opensearch_dashboards_sample_data_logs
    | rex field=message '.+"(?<RestMethod>(GET|POST)) .+?(?<HttpVersion>HTTP/[0-9\.]+)" (?<HttpStatus>\d+) (?<BytesSent>.+?) .+'
    | fields RestMethod, HttpVersion, HttpStatus, BytesSent
    | head 1
    
  3. Observe that values are shifted: RestMethod appears in both RestMethod and HttpVersion columns, HttpVersion value appears in HttpStatus column, etc.

Comparison Test:

  • Without nested groups (works correctly):

    source=opensearch_dashboards_sample_data_logs
    | rex field=message '.+"(?<Method>\w+) (?<Path>.+?) (?<Version>HTTP/[0-9\.]+)'
    | fields Method, Path, Version
    | head 1
    

    Result: ✅ Correct - "GET", "/opensearch/opensearch-1.0.0.deb_1", "HTTP/1.1"

  • With nested groups (fails):

    source=opensearch_dashboards_sample_data_logs
    | rex field=message '.+"(?<Method>(GET|POST)) (?<Path>.+?) (?<Version>HTTP/[0-9\.]+)'
    | fields Method, Path, Version
    | head 1
    

    Result: ❌ Incorrect - "GET", "GET", "/opensearch/opensearch-1.0.0.deb_1"

Impact:
This bug makes it impossible to use alternation patterns (e.g., (GET|POST), (error|warning)) within named capture groups, which is a common use case for log parsing. Users must work around this by:

  • Using character classes instead of alternation where possible (e.g., \w+ instead of (GET|POST))
  • Post-processing results to manually shift values
  • Avoiding nested groups entirely, which limits regex expressiveness

Environment Information

OpenSearch Version: 3.3.0-SNAPSHOT

Additional Details:

  • Plugin: SQL/PPL Plugin (Calcite-based engine)
  • Tested via REST API: POST /_plugins/_ppl

Root Cause Analysis (Preliminary)

Location: core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (lines 261-321)

Issue:
The visitRex method uses RegexCommonUtils.getNamedGroupCandidates() to extract named group names, then creates REX_EXTRACT calls with indices i + 1 (lines 284-302):

for (int i = 0; i < namedGroups.size(); i++) {
  extractCall = PPLFuncImpTable.INSTANCE.resolve(
      context.rexBuilder,
      BuiltinFunctionName.REX_EXTRACT,
      fieldRex,
      context.rexBuilder.makeLiteral(patternStr),
      context.relBuilder.literal(i + 1));  // ← Assumes only named groups exist
  newFields.add(extractCall);
  newFieldNames.add(namedGroups.get(i));
}

However, Java's regex engine counts all capture groups (both named and unnamed) when assigning group indices. For pattern (?<Method>(GET|POST)):

  • Group 0: entire match
  • Group 1: (?<Method>...) (named)
  • Group 2: (GET|POST) (unnamed, nested)

The code assumes group 1 is the first named group, but doesn't account for unnamed groups that shift the indices.

Explain Plan Evidence:

LogicalProject(
  RestMethod=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 1)],
  HttpVersion=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 2)],
  HttpStatus=[REX_EXTRACT($7, '.+"(?<RestMethod>(GET|POST)) ...', 3)],
  ...
)

The indices 1, 2, 3 don't correspond to the named groups when unnamed groups are present.

Proposed Fix (Preliminary)

Option 1: Use Named Group Extraction (Recommended)
Modify RexExtractFunction.extractGroup() to accept group names instead of indices:

public static String extractGroup(String text, String pattern, String groupName) {
  Pattern compiledPattern = Pattern.compile(pattern);
  Matcher matcher = compiledPattern.matcher(text);
  if (matcher.find()) {
    return matcher.group(groupName);  // Use name instead of index
  }
  return null;
}

Update CalciteRelNodeVisitor.visitRex() to pass group names:

for (String groupName : namedGroups) {
  extractCall = PPLFuncImpTable.INSTANCE.resolve(
      context.rexBuilder,
      BuiltinFunctionName.REX_EXTRACT,
      fieldRex,
      context.rexBuilder.makeLiteral(patternStr),
      context.rexBuilder.makeLiteral(groupName));  // Pass name, not index
  newFields.add(extractCall);
  newFieldNames.add(groupName);
}

Option 2: Calculate Correct Group Indices
Parse the regex pattern to count all groups (named and unnamed) and map named groups to their actual indices. This is more complex and error-prone.

Workaround

Until fixed, avoid nested unnamed capture groups:

# Instead of: (?<Method>(GET|POST))
# Use: (?<Method>GET|POST)  or  (?<Method>\w+)

Related Files

  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (lines 261-321)
  • core/src/main/java/org/opensearch/sql/expression/function/udf/RexExtractFunction.java (lines 54-67)
  • core/src/main/java/org/opensearch/sql/expression/parse/RegexCommonUtils.java (lines 60-68)
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteRexCommandIT.java (test coverage needed)

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions