Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 6, 2026

Rationale for this change

When reading a CSV file encounters an early error, and the AsyncThreadedTableReader is only kept alive through the chain of futures and their callbacks, the ~AsyncThreadedTableReader destructor will be executed at the end of a TaskGroup task and try to wait for the TaskGroup itself for finish. This will obviously deadlock.

This issue was discovered by OSS-Fuzz in https://issues.oss-fuzz.com/issues/467451924

What changes are included in this PR?

Instead of waiting for the TaskGroup to finish in the AsyncThreadedTableReader, make sure that all async callbacks involved in CSV reading own their captured variables, to avoid use-after-free problems.

This has the side effect of keeping the AsyncThreadedTableReader alive until all relevant async callbacks have executed.

Are these changes tested?

Yes, by existing tests and a new fuzz regression test.

Are there any user-facing changes?

No.

@pitrou pitrou marked this pull request as ready for review January 6, 2026 16:38
@pitrou pitrou requested a review from zanmato1984 January 6, 2026 16:41
@pitrou pitrou force-pushed the gh48741-csv-async-deadlock branch from 932821c to cdf530b Compare January 6, 2026 17:36
@pitrou
Copy link
Member Author

pitrou commented Jan 6, 2026

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@zanmato1984
Copy link
Contributor

Why is the deadlock only happening when early error? Also, is it possible to construct a dedicated test case?

@pitrou
Copy link
Member Author

pitrou commented Jan 7, 2026

Why is the deadlock only happening when early error?

Because otherwise the task group is already finished (FinishAsync has finished) when the Future is marked successful.

Also, is it possible to construct a dedicated test case?

Hmm, that should hopefully be possible.

@zanmato1984
Copy link
Contributor

Why is the deadlock only happening when early error?

Because otherwise the task group is already finished (FinishAsync has finished) when the Future is marked successful.

I see, thanks.

Also, is it possible to construct a dedicated test case?

Hmm, that should hopefully be possible.

Yeah, let's have one please.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix LGTM, only that a test might be needed.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 7, 2026
@pitrou pitrou force-pushed the gh48741-csv-async-deadlock branch 2 times, most recently from a8c352d to 12a8dea Compare January 7, 2026 16:46
@pitrou
Copy link
Member Author

pitrou commented Jan 7, 2026

@github-actions crossbow submit -g cpp

@pitrou
Copy link
Member Author

pitrou commented Jan 7, 2026

@zanmato1984 I added a test and also had to strengthen object ownership to avoid another regression, could you take another look?

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Revision: 12a8dea

Submitted crossbow builds: ursacomputing/crossbow @ actions-7019b48426

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-cpp-ubuntu-24.04-cuda-13.0.2 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-experimental-cpp-gcc-15 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

return Status::OK();
};

return VisitAsyncGenerator(std::move(block_generator), block_visitor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I have to double check the deadlock reasoning.

Seems that self is well captured by the returned future. So the deadlock happens only when: 1) early error happens and FinishAsync() is skipped; 2) the returned future is marked finished (by error) and released in its owning thread (the "main" thread), self ref count decreased but not to 0 (because of); 3) some task in the task group is still running, and the task holds a strong ref to self captured in block_visitor (so far the task group, tasks, and the self it owned are fully detached from the main thread?); 4) as the remaining tasks finish, self ref count goes to 0 and invokes the AsyncThreadedTableReader dtor, which in turn drains the task group in which the dtor is being invoked - deadlock?

If my understanding is correct, it doesn't seem that we have ownership issues:

  • If no error, then the main thread owns the future, which owns self, until all tasks are finished.
  • If early error, then the tasks own self.
  • self owns all the things: options, task group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the remaining tasks finish, self ref count goes to 0 and invokes the AsyncThreadedTableReader dtor, which in turn drains the task group in which the dtor is being invoked - deadlock?

Exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, it doesn't seem that we have ownership issues:

  • If no error, then the main thread owns the future, which owns self, until all tasks are finished.
  • If early error, then the tasks own self.
  • self owns all the things: options, task group.

Please also see my this question :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what is the this question? I'm probably dense, but I can't find anywhere in the comments page :-S

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, don't know what's going on. Let me ask this: Given that my assumption of the cause of deadlock is correct, it doesn't seem to me that we have ownership issues:

  • If no error, then the main thread owns the future, which owns self, until all tasks are finished.
  • If early error, then the tasks own self.
  • In either case, self owns all the things: options, task group.

So why do we have to make all the ownership related changes in the rest of the PR?

Copy link
Contributor

@zanmato1984 zanmato1984 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I may have an answer to my last question.

  • If early error, then the tasks own self.

self will be destructed in the last task, by when the task_group must be still alive. This explains the changes to the task_group_ ownership.

But why the changes to the options_ ownership? It doesn't have to out-live self, does it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why the changes to the options_ ownership? It doesn't have to out-live self, does it?

It does, because the ColumnBuilder can schedule its own tasks without owning the table reader:

void InferringColumnBuilder::ScheduleConvertChunk(int64_t chunk_index) {
task_group_->Append([this, chunk_index]() { return TryConvertChunk(chunk_index); });
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Thank you.

@pitrou pitrou force-pushed the gh48741-csv-async-deadlock branch from 12a8dea to 530942d Compare January 8, 2026 15:01
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with some nits.

protected:
// Make column builders from conversion schema
Status MakeColumnBuilders() {
// This is making a single copy of ConvertOptions, which is reasonable even if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is n shared_ptr VS. n copies of ConvertOptions? (Which makes sense.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

return Status::OK();
};

return VisitAsyncGenerator(std::move(block_generator), block_visitor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Thank you.

@pitrou pitrou force-pushed the gh48741-csv-async-deadlock branch from 530942d to 87316ce Compare January 8, 2026 15:37
@pitrou
Copy link
Member Author

pitrou commented Jan 8, 2026

I'm going to merge as CI is green. Thank you for the reviews @zanmato1984 !

@pitrou pitrou merged commit 4eccdad into apache:main Jan 8, 2026
54 of 56 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 8, 2026
@pitrou pitrou deleted the gh48741-csv-async-deadlock branch January 8, 2026 17:18
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 4eccdad.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants