KAFKA-18475: Flaky PlaintextProducerSendTest testCloseWithZeroTimeoutFromCallerThread #20791
+26
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The test
testCloseWithZeroTimeoutFromCallerThreadis flaky. The consumer may gets all of the messages after the producer is force closed, while futures in the producer are completed exceptionally.The bug comes from a race condition introduced by
RecordAccumulator#closeandRecordAccumulator#batchReady.RecordAccumulator#closesets the closed flag to true, andRecordAccumulator#batchReadythinks the batch is sendable. As a result those batches are sent in the sameSender#runOncecall becauserunOncedoesn't check theforceCloseflag.Test
An unit test is added to
SenderTest. It asserts that after a sender is force closed no message should be sent or polled.Change
It is hard to fully eliminate the bug:
Sender#forceClosecan happen at any point ofSender#runOncesince they run in different threads. The only way to ensure that "no action is permitted after force close" is to lockrunOnce, which is expensive.Adding a check on the flag before the poll in
runOncecan reduce the chance of the bug. Now the race condition only happens if sender is force closed during the poll. Notice that this eliminates the flaky test. In the test scenario, if poll happens during the poll, the client has nothing to operate in this round, and there is no next run.