Use CUDA streams in all pylibcudf calls made by cudf-polars #20291

TomAugspurger · 2025-10-16T21:41:46Z

Description

This adds CUDA streams to all pylibcudf calls in cudf-polars.

At the moment, we continue to use the default stream for all operations, so we're explicitly using the default stream. A future PR will update things to use non-default streams.

As far as I can tell, this should get all the pylibcudf calls in cudf-polars. It's a lot of code to review. Unfortunately, it mixes many trivial changes (add stream=stream to a bunch of spots) with a handful of non-trivial changes. I'll comment inline on all the non-trivial changes. I'm more than happy to break those changes out to their own PR (but it gets complicated. The changes to Column.nan_count, for example, forces the change to broadcast and aggregation.py...)

Closes #20239

Part of #20228

copy-pr-bot · 2025-10-16T21:41:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

TomAugspurger

I haven't finished flagging the non-trivial changes, but I have to run. I'll finish that up later.

python/cudf_polars/cudf_polars/containers/column.py

python/cudf_polars/cudf_polars/dsl/expressions/aggregation.py

python/cudf_polars/cudf_polars/dsl/expressions/boolean.py

python/cudf_polars/cudf_polars/dsl/expressions/rolling.py

python/cudf_polars/cudf_polars/dsl/utils/windows.py

python/cudf_polars/cudf_polars/dsl/ir.py

python/cudf_polars/cudf_polars/dsl/to_ast.py

python/cudf_polars/cudf_polars/experimental/base.py

python/cudf_polars/cudf_polars/experimental/sort.py

python/cudf_polars/cudf_polars/utils/cuda_stream.py

This adds CUDA streams to all pylibcudf calls in cudf-polars. At the moment, we continue to use the default stream for all operations, so we're *explicitly* using the default stream. A future PR will update things to use non-default streams.

python/cudf_polars/cudf_polars/experimental/base.py

python/cudf_polars/cudf_polars/dsl/ir.py

This adds CUDA streams to `cudf_polars.dsl.expressions.aggregation`. Streams are still missing from some `cudf_polars.containers.Column` calls in this file, but all the directly pylibcudf calls should be covered. Split off rapidsai#20291.

TomAugspurger

The remaining non-trivial changes.

python/cudf_polars/cudf_polars/experimental/explain.py

python/cudf_polars/cudf_polars/experimental/statistics.py

python/cudf_polars/cudf_polars/experimental/base.py

…cuda-stream-everything

python/cudf_polars/cudf_polars/containers/column.py

python/cudf_polars/cudf_polars/experimental/base.py

Matt711 · 2025-10-23T18:26:41Z

python/cudf_polars/cudf_polars/dsl/ir.py

-            ast_result = to_ast(predicate)
+            stream = get_cuda_stream()
+            ast_result = to_ast(predicate, stream=stream)
+            stream.synchronize()


Wait...Why are we synchronizing here?

See the discussion at #20291 (comment). In these spots, we don't have natural access to a CUDA stream like we do in IR.do_evaluate. But, we do need to ensure that the output is valid by the time we use the result. So our two options were

Have some sort of well-known stream ("a stream singleton" I called it) that can do these things and be synchronized with before using the result

Ensure the result is valid before returning control flow to the rest of the program (i.e. call stream.syncrhonize()

All the cases in the thread linked above are quick things just initializing some small pieces of data, so calling stream.synchronize() here should be fine.

I think another option is to do it lazily

def to_ast_lazy(predicate): def f(stream): return to_ast(predicate, stream=stream) return f

And then call when we have the stream to_ast_lazy(predicate)(df.stream)

Yep, that seems like it'd work for our (current) usage of ConditionalJoin.Predicate.ast, which is in a do_evaluate. I'm less certain whether it'd work for duration_to_scalar and the stats things.

Given how this is used (just making a plc.Scalar) I think a synchronize is OK.

I'm less certain whether it'd work for duration_to_scalar and the stats things.

FWIW I think it should work.

Given how this is used (just making a plc.Scalar) I think a synchronize is OK.

I think we should try to avoid stream syncs when we can. Maybe add TODO comment saying we should consider implementing the lazy approach or the refactor @wence- #20291 (comment) mentioned?

Handling stats is proving more complicated. IIUC, we use those results to make lowering decisions, so it really does need to be done before we're in some IR.do_evaluate. So synchronization is required / isn't an (additional) problem.

Could CUDA events could help a little here? We record the event and hand that along with the stream as a "future." And then only when we need the host result, we sync. Idk though, I'm less confident about this one, but a TODO comment probably doesn't hurt? It would have to be TODO I think because IIRC you mentioned that CUDA Events aren't supported in RMM yet.

As you say, lack of an Event API is one thing preventing us from doing this. The join_streams implementation uses events to record an event on one stream and (asynchronously) wait for it on another.

But then we run into the fact that at the time we need to join the two streams (the stream the stats collection happened on and the stream we're wanting to use that result), we don't have access to the original stream anymore. My earlier "stream singleton" thing did this, but end up being ugly. And I think it ended up not being worthwhile since IIUC we need to synchronize anyway in order to make decisions about how to lower the IR nodes.

And I think it ended up not being worthwhile since IIUC we need to synchronize anyway in order to make decisions about how to lower the IR nodes.

Ok thanks, thought I'd just mention it. That makes sense.

The to_ast change is causing problems since we used that for validating whether something was supported and raising a NotImplementedError. By deferring the to_ast until inside do_evaluate, we've passed the time when we can do the usual fallback.

For now, I've reverted the to_ast change (so we're back to a synchronize). But I'll see if there's an easy way to do it for this PR. We might need to split out the "validate whether you can do this" part from the "actually do stuff that might require a stream" part.

I gave rewriting how we use to_ast one more shot and failed. More details at #20372

For now, I think we just live with this temporary stream & synchronization. I'll look into it more this week.

…eam-everything

python/cudf_polars/cudf_polars/containers/column.py

See the discussion at https://github.com/rapidsai/cudf/pull/20291/files/76569ec1a71fafcb5636bf0aacaba7b12e610a7e#r2437537977

coderabbitai · 2025-10-23T19:57:21Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…eam-everything

Move `to_ast` call to a `do_evaluate` context.

This updates the callers of `offsets_to_windows` to wait until they're in a context with a stream available.

This reverts commit 4bce714.

This defers the point at which we call offsets_to_windows, until a time when we have a CUDA stream in do_evaluate. We do still perform some of the work (calling duration_to_int) early on to catch any translation issues (e.g. we want to raise with month is non-zero, since we don't support that).

…eam-everything

Matt711

Thanks for helping me understand streams better, this is getting close

python/cudf_polars/cudf_polars/experimental/dispatch.py

python/cudf_polars/cudf_polars/experimental/explain.py

python/cudf_polars/cudf_polars/experimental/statistics.py

python/cudf_polars/cudf_polars/experimental/sort.py

Matt711 · 2025-10-24T14:20:59Z

python/cudf_polars/cudf_polars/dsl/ir.py

-            ast_result = to_ast(predicate)
+            stream = get_cuda_stream()
+            ast_result = to_ast(predicate, stream=stream)
+            stream.synchronize()


Handling stats is proving more complicated. IIUC, we use those results to make lowering decisions, so it really does need to be done before we're in some IR.do_evaluate. So synchronization is required / isn't an (additional) problem.

Could CUDA events could help a little here? We record the event and hand that along with the stream as a "future." And then only when we need the host result, we sync. Idk though, I'm less confident about this one, but a TODO comment probably doesn't hurt? It would have to be TODO I think because IIRC you mentioned that CUDA Events aren't supported in RMM yet.

Matt711 · 2025-10-24T14:26:56Z

python/cudf_polars/cudf_polars/dsl/ir.py

+                raise NotImplementedError(
+                    f"Conditional join with predicate {self.predicate}"


The upstream polars tests are failing because we're raising in do_evaluate now

Can we separate concerns here? That is split up "translating the predicate?" from "allocate scalars on stream"?

…eam-everything

This reverts commit 617f450.

…eam-everything

TomAugspurger · 2025-10-27T15:59:19Z

/merge

#20291 missed a spot in `Join` where we need to pass the CUDA stream to the pylibcudf join function. This shows up in PDSH query 4. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #20398

github-actions bot assigned TomAugspurger Oct 16, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Oct 16, 2025

github-project-automation bot added this to cuDF Python Oct 16, 2025

GPUtester moved this to In Progress in cuDF Python Oct 16, 2025

TomAugspurger commented Oct 16, 2025

View reviewed changes

TomAugspurger commented Oct 17, 2025

View reviewed changes

TomAugspurger force-pushed the tom/polars-cuda-stream-everything branch 3 times, most recently from 27e8178 to 1d32cdc Compare October 17, 2025 13:25

Add CUDA streams to cudf-polars

1d32cdc

This adds CUDA streams to all pylibcudf calls in cudf-polars. At the moment, we continue to use the default stream for all operations, so we're *explicitly* using the default stream. A future PR will update things to use non-default streams.

TomAugspurger commented Oct 17, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/base.py Show resolved Hide resolved

TomAugspurger mentioned this pull request Oct 17, 2025

Add streams to cudf_polars.dsl.expressions.string #20295

Closed

wence- reviewed Oct 17, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

TomAugspurger mentioned this pull request Oct 17, 2025

Add streams to dsl.expressions.aggregation #20299

Closed

TomAugspurger added 3 commits October 17, 2025 10:13

Remove stream singleton for duration_to_scalar

84d4fda

Remove get_stream_for_stats

61ad556

Remove get_stream_for_conditional_join_predicate

af6b7e0

TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 17, 2025

TomAugspurger commented Oct 17, 2025

View reviewed changes

TomAugspurger changed the title ~~[WIP]: Add CUDA streams to cudf-polars~~ Add CUDA streams to cudf-polars Oct 17, 2025

TomAugspurger marked this pull request as ready for review October 17, 2025 20:27

TomAugspurger requested a review from a team as a code owner October 17, 2025 20:27

TomAugspurger requested review from Matt711 and mroeschke October 17, 2025 20:27

TomAugspurger commented Oct 17, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/branch-25.12' into tom/polars-…

9bae4df

…cuda-stream-everything

Matt711 reviewed Oct 23, 2025

View reviewed changes

TomAugspurger added 4 commits October 23, 2025 12:09

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

78839ac

…eam-everything

Synchronize when accessing cached scalars

b2d17e6

Small fixes

cc3bd58

add a note

76569ec

Matt711 reviewed Oct 23, 2025

View reviewed changes

python/cudf_polars/cudf_polars/containers/column.py Outdated Show resolved Hide resolved

Remove caching in Column.nan_count, obj_scalar

bbf37e9

See the discussion at https://github.com/rapidsai/cudf/pull/20291/files/76569ec1a71fafcb5636bf0aacaba7b12e610a7e#r2437537977

TomAugspurger added 6 commits October 24, 2025 04:31

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

80de395

…eam-everything

Refactor ConditionalJoin.Predicate.ast

617f450

Move `to_ast` call to a `do_evaluate` context.

Defer offsets_to_windows computation

4bce714

This updates the callers of `offsets_to_windows` to wait until they're in a context with a stream available.

Revert "Defer offsets_to_windows computation"

9110f89

This reverts commit 4bce714.

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

4610a1b

…eam-everything

Matt711 reviewed Oct 24, 2025

View reviewed changes

TomAugspurger added 7 commits October 24, 2025 09:24

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

094795b

…eam-everything

Revert formatting changes

a7b12b0

Revert "Refactor ConditionalJoin.Predicate.ast"

064af86

This reverts commit 617f450.

link issue

1cd5280

fix missing stream

316d6cf

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

ec90dde

…eam-everything

Merge remote-tracking branch 'upstream/main' into tom/polars-cuda-str…

c62e41d

…eam-everything

Matt711 approved these changes Oct 27, 2025

View reviewed changes

TomAugspurger mentioned this pull request Oct 27, 2025

[FEA]: Avoid stream synchronization in ConditinalJoin.Predicate.__init__ #20372

Open

rapids-bot bot merged commit fa41b4e into rapidsai:main Oct 27, 2025
132 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Oct 27, 2025

TomAugspurger deleted the tom/polars-cuda-stream-everything branch October 27, 2025 16:02

TomAugspurger mentioned this pull request Oct 28, 2025

Add missing CUDA stream to cudf-polars left-semi join #20398

Merged

		raise NotImplementedError(
		f"Conditional join with predicate {self.predicate}"

Use CUDA streams in all pylibcudf calls made by cudf-polars #20291

Use CUDA streams in all pylibcudf calls made by cudf-polars #20291

Uh oh!

Conversation

TomAugspurger commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

copy-pr-bot bot commented Oct 16, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Oct 23, 2025

Review skipped

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

TomAugspurger commented Oct 16, 2025 •

edited

Loading