Skip to content

Conversation

@1996fanrui
Copy link
Member

What is the purpose of the change

Problem

RecordsWindowBuffer.addElement() catches EOFException and retries recursively, causing StackOverflowError when retry keeps failing.

Root Cause

The original code assumes "flush will always free enough space for retry". This assumption fails when unrecoverable errors occur, leading to infinite recursion.

Solution

Use numKeys == 0 as the termination condition:

while (true) {
    LookupInfo<WindowKey, Iterator<RowData>> lookup = recordsBuffer.lookup(reuseWindowKey);
    try {
        recordsBuffer.append(lookup, recordSerializer.toBinaryRow(element));
        break;
    } catch (EOFException e) {
        if (recordsBuffer.getNumKeys() == 0) {
            // Buffer is empty, retry won't help (unrecoverable error)
            throw e;
        }
        // Buffer has data, flush and retry
        flush();
        checkState(recordsBuffer.getNumKeys() == 0, "The recordsBuffer should be empty after flushing.");
    }
}

Why This Works

Scenario numKeys before catch Behavior
Buffer full (recoverable) > 0 flush → numKeys=0 → retry → success
Unrecoverable error (1st attempt) 0 throw immediately
Unrecoverable error (after flush retry) 0 flush → numKeys=0 → retry → fail again → numKeys still 0 → throw

Key point: flush() clears the buffer, making numKeys == 0. If retry still fails after flush, entering the catch block again with numKeys == 0 indicates the problem is not caused by a full buffer, but an unrecoverable error. In this case, an exception should be thrown instead of continuing to retry.

Benefits

  1. Prevents StackOverflowError: The while loop executes at most twice (initial attempt + retry after flush), no infinite loop
  2. Preserves normal behavior: Normal flush + retry still works when buffer is full
  3. Better diagnostics: Unrecoverable errors throw a complete EOFException with full stack trace, making it easy to identify the root cause
  4. Minimal change: Only checks numKeys, no additional state variables introduced

Brief change log

  • [hotfix] WindowBuffer implements AutoCloseable to avoid declare close explicitly
  • [FLINK-38746][table-runtime] Fix StackOverflowError in RecordsWindowBuffer.addElement

Verifying this change

  • Added RecordsWindowBufferTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 27, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@1996fanrui
Copy link
Member Author

@flinkbot run azure

@1996fanrui 1996fanrui force-pushed the 38746/Fix-StackOverflowError-in-RecordsWindowBuffer branch from 012be30 to e53fbfb Compare November 28, 2025 10:05
flush();
// remember to add the input element again
addElement(key, sliceEnd, element);
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to move the above line
minSliceEnd = Math.min(sliceEnd, minSliceEnd);
inside the loop?

In flush(), it might be reset to Long.MAX_VALUE, so we'll end up with the wrong minSliceEnd field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I have moved it between append and break, and introduced a new test to cover this case.

Comment on lines +94 to +96
if (recordsBuffer.getNumKeys() == 0) {
// Buffer is empty, retry won't help (record is too large for the buffer)
throw e;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getNumKeys correct and sufficient?
Or should getNumElements be used instead or in addition?

In append(), I see that numKeys is not incremented if lookupInfo.found.
I guess there's an implicit invariant that numKeys > 0 in this case, but I'm not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equivalent.

  • One element must have one key
  • Every key has at least one element
  • numKeys == 0numElements == 0 ⟺ buffer is empty

@1996fanrui 1996fanrui force-pushed the 38746/Fix-StackOverflowError-in-RecordsWindowBuffer branch from e53fbfb to a726361 Compare December 3, 2025 15:31
Copy link
Member Author

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rkhachatryan for the review, all your comments are addressed.

Comment on lines +94 to +96
if (recordsBuffer.getNumKeys() == 0) {
// Buffer is empty, retry won't help (record is too large for the buffer)
throw e;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equivalent.

  • One element must have one key
  • Every key has at least one element
  • numKeys == 0numElements == 0 ⟺ buffer is empty

flush();
// remember to add the input element again
addElement(key, sliceEnd, element);
while (true) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I have moved it between append and break, and introduced a new test to cover this case.

@1996fanrui 1996fanrui merged commit 644f461 into apache:master Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants