feat: Add optimization to randomize null key in semi join #26251

feilong-liu · 2025-10-07T23:23:27Z

Description

This PR is to add an optimization to deal with NULL skew on the source side/left side in SemiJoin.
The motivating query is of pattern like:

SELECT
    *
FROM lineitem l
WHERE
    l.partkey IN (
        SELECT
            partkey
        FROM partsupp
    )
    OR l.orderkey IN (
        SELECT
            orderkey
        FROM orders
    );

Here l.partkey or l.orderkey has a significant portion of values to be NULL. For this case, we cannot simply filtering rows which have NULL in semi join key, as for rows with l.partkey to be NULL, l.orderkey may not be NULL and return true for the semi join. In order to mitigate the NULL skew, we apply a similar optimization as in https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/RandomizeNullKeyInOuterJoin.java where we coalesce the NULL keys to a new key on the left/source side.

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we apply one more projection at the end semi_output := new_semi_output OR (source_key IS NULL ? NULL : FALSE). Here when source_key is NULL, we will have new_semi_output to be false, and semi_output back to NULL. While when source_key is not NULL, new_semi_output OR FALSE is still new_semi_output which does not change the result.

Motivation and Context

As described above.

Impact

For the motivating production queries, we observe a latency drop from 3 hours to only 4 minutes.

Test Plan

Unit tests, also run verifier suites with production queries suite1

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a new optimizer which do null skew mitigation for applicable semi joins

sourcery-ai · 2025-10-07T23:23:33Z

Reviewer's Guide

Introduces a new optimizer to randomize null join keys in semi joins, driven by a configurable strategy, and integrates it into the planning pipeline with accompanying configuration and tests.

Class diagram for new and updated join key randomization types

classDiagram
    class FeaturesConfig {
        -boolean nativeEnforceJoinBuildInputPartition
        -boolean randomizeOuterJoinNullKey
        -RandomizeOuterJoinNullKeyStrategy randomizeOuterJoinNullKeyStrategy
        -RandomizeSemiJoinNullKeyStrategy randomizeSemiJoinNullKeyStrategy
        -ShardedJoinStrategy shardedJoinStrategy
        -int joinShardCount
        -boolean isOptimizeConditionalAggregationEnabled
        +RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy()
        +FeaturesConfig setRandomizeSemiJoinNullKeyStrategy(RandomizeSemiJoinNullKeyStrategy)
    }
    class RandomizeOuterJoinNullKeyStrategy {
        <<enum>>
        DISABLED
        ALWAYS
    }
    class RandomizeSemiJoinNullKeyStrategy {
        <<enum>>
        DISABLED
        ALWAYS
    }
    FeaturesConfig --> RandomizeOuterJoinNullKeyStrategy
    FeaturesConfig --> RandomizeSemiJoinNullKeyStrategy
    class SystemSessionProperties {
        +RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy(Session)
    }
    SystemSessionProperties --> RandomizeSemiJoinNullKeyStrategy

Class diagram for the new RandomizeNullKeyInSemiJoin optimizer

classDiagram
    class RandomizeNullKeyInSemiJoin {
        -FunctionAndTypeManager functionAndTypeManager
        -boolean isEnabledForTesting
        +RandomizeNullKeyInSemiJoin(FunctionAndTypeManager)
        +void setEnabledForTesting(boolean)
        +boolean isEnabled(Session)
        +PlanOptimizerResult optimize(PlanNode, Session, TypeProvider, VariableAllocator, PlanNodeIdAllocator, WarningCollector)
    }
    class SemiJoinUsageCollector {
        +void collect(PlanNode)
        +Set<SemiJoinNode> getEligibleSemiJoins()
    }
    class Rewriter {
        -Session session
        -FunctionAndTypeManager functionAndTypeManager
        -PlanNodeIdAllocator planNodeIdAllocator
        -VariableAllocator planVariableAllocator
        -boolean planChanged
        +boolean isPlanChanged()
        +PlanNode visitSemiJoin(SemiJoinNode, RewriteContext<Set<SemiJoinNode>>)
    }
    RandomizeNullKeyInSemiJoin --> SemiJoinUsageCollector
    RandomizeNullKeyInSemiJoin --> Rewriter

File-Level Changes

Change	Details	Files
Define and expose new semi join null-key randomization strategy in feature configuration	Add RandomizeSemiJoinNullKeyStrategy enum (DISABLED, ALWAYS) Introduce randomizeSemiJoinNullKeyStrategy field with default DISABLED Provide getter, @config setter, and description	`FeaturesConfig.java`
Add session property for semi join null-key randomization	Define RANDOMIZE_SEMI_JOIN_NULL_KEY_STRATEGY constant Register PropertyMetadata with allowed enum values Implement getRandomizeSemiJoinNullKeyStrategy()	`SystemSessionProperties.java`
Hook up the semi join randomization optimizer in the planner	Insert RandomizeNullKeyInSemiJoin into the optimizer sequence Preserve existing outer join randomization placement	`PlanOptimizers.java`
Implement RandomizeNullKeyInSemiJoin optimization	Traverse plan to identify eligible semi joins based on output usage Inject randomized key expressions via Project nodes Reuse random keys for multiple joins on the same variable Support enabling via session strategy or testing flag	`RandomizeNullKeyInSemiJoin.java`
Add and update tests for semi join null-key randomization	Create comprehensive TestRandomizeNullKeyInSemiJoin covering eligibility and edge cases Extend TestFeaturesConfig to include mapping and defaults for new strategy	`TestRandomizeNullKeyInSemiJoin.java` `TestFeaturesConfig.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/optimizations/TestRandomizeNullKeyInSemiJoin.java:42-51` </location>
<code_context>
+                false);
+    }
+
+    @Test
+    public void testSemiJoinWithUnsupportedJoinKeyType()
+    {
+        // Semi join with varchar keys - should NOT be optimized (only INTEGER, BIGINT, DATE are supported)
</code_context>

<issue_to_address>
**suggestion (testing):** Missing test for DATE join key type in semi join randomization.

Please add a test case for semi join randomization with a DATE join key to verify correct optimizer behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

kaikalur · 2025-10-09T19:31:06Z

.../src/main/java/com/facebook/presto/sql/planner/optimizations/RandomizeNullKeyInSemiJoin.java

+    }
+
+    private static class Rewriter
+            extends SimplePlanRewriter<Set<SemiJoinNode>>


could we move some of the randomization logic to planner utils so it's common across the two optimizers?

Moved the function to generate random join key into a separate util files.
For other parts of the rewriter, the code in the outer join optimizer is more complex, as it contains an additional option which tracks if the input key is from an outer join or not, and cost based trigger with HBO. Also outer joins can have multiple join keys and we do randomization for all join keys. While for semi join, we only have one join key. These will make extracting the rest of logic a bit complex. Consider this code is relatively small and we see it will only be used in these two cases, I would prefer to leave the rest of code as it is now.

kaikalur · 2025-10-09T20:18:18Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t)
AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

feilong-liu · 2025-10-10T06:11:23Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

When x is NULL, x in (select x from t) will be NULL, while randomize(x) in (select randomize(x) from t) AND if(x is null, null, true) will be false AND NULL which will still be false.

And we may do some special handling for NULL keys on the left side, but we will still face problem for cases where the right side have NULLs

kaikalur · 2025-10-10T15:08:19Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

kaikalur · 2025-10-10T15:09:13Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

When x is NULL, x in (select x from t) will be NULL, while randomize(x) in (select randomize(x) from t) AND if(x is null, null, true) will be false AND NULL which will still be false.

And we may do some special handling for NULL keys on the left side, but we will still face problem for cases where the right side have NULLs

Yeah - see my comment - don't salt/randomize right side!

feilong-liu · 2025-10-11T04:47:29Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false))
And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?

Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

kaikalur · 2025-10-11T13:35:32Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false)) And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?

Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

I say just don't salt right side. It's extremely rare and also we should distinct that will remove skew. and no special context checks required at all with that.

feilong-liu · 2025-10-13T03:37:34Z

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false)) And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?
Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

I say just don't salt right side. It's extremely rare and also we should distinct that will remove skew. and no special context checks required at all with that.

Updated the code to do salt only on the left side.

rschlussel · 2025-10-13T19:33:32Z

...c/main/java/com/facebook/presto/sql/planner/iterative/rule/RandomizeSourceKeyInSemiJoin.java

+        this.functionAndTypeManager = requireNonNull(functionAndTypeManager, "functionAndTypeManager is null");
+    }
+
+    private static boolean isSupportedType(VariableReferenceExpression variable)


why just these types? Seems like anything other than like char/varchar/varbinary where you might actually be generating real keys would be fine.

For float/double, I am not sure how cast to varchar works for it, so I didn't include them. For other types, I think maybe they are not used so widely in semi join, that's why I chose to be conservative in this PR. If we find more use cases for other types, we can revisit it.

feilong-liu requested review from a team, ClarenceThreepwood and jaystarshot as code owners October 7, 2025 23:23

prestodb-ci added the from:Meta PR from Meta label Oct 7, 2025

feilong-liu marked this pull request as draft October 7, 2025 23:23

sourcery-ai bot reviewed Oct 7, 2025

View reviewed changes

feilong-liu changed the title ~~Add optimization to randomize null key in semi join~~ feat: Add optimization to randomize null key in semi join Oct 7, 2025

feilong-liu requested review from kaikalur and rschlussel October 9, 2025 17:49

feilong-liu marked this pull request as ready for review October 9, 2025 17:49

feilong-liu force-pushed the semi_random branch from 0ff190b to fd93f82 Compare October 9, 2025 17:58

kaikalur reviewed Oct 9, 2025

View reviewed changes

feilong-liu force-pushed the semi_random branch from fd93f82 to 5d5a9ce Compare October 10, 2025 06:31

feilong-liu requested a review from kaikalur October 10, 2025 06:41

feilong-liu force-pushed the semi_random branch from 5d5a9ce to b132289 Compare October 13, 2025 03:20

feilong-liu requested a review from elharo as a code owner October 13, 2025 03:20

feilong-liu force-pushed the semi_random branch from b132289 to 3c99f10 Compare October 13, 2025 03:33

Add optimization to randomize null key in semi join

761beb4

feilong-liu force-pushed the semi_random branch from 3c99f10 to 761beb4 Compare October 13, 2025 04:08

kaikalur approved these changes Oct 13, 2025

View reviewed changes

rschlussel approved these changes Oct 13, 2025

View reviewed changes

feilong-liu merged commit d1c24bb into prestodb:master Oct 14, 2025
77 of 79 checks passed

feilong-liu deleted the semi_random branch October 14, 2025 05:01

feat: Add optimization to randomize null key in semi join #26251

feat: Add optimization to randomize null key in semi join #26251

Conversation

feilong-liu commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

sourcery-ai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for new and updated join key randomization types

Class diagram for the new RandomizeNullKeyInSemiJoin optimizer

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kaikalur Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

feilong-liu Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Oct 9, 2025

Uh oh!

feilong-liu commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Oct 10, 2025

Uh oh!

kaikalur commented Oct 10, 2025

Uh oh!

feilong-liu commented Oct 11, 2025

Uh oh!

kaikalur commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feilong-liu commented Oct 13, 2025

Uh oh!

rschlussel Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

feilong-liu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feilong-liu commented Oct 7, 2025 •

edited

Loading

sourcery-ai bot commented Oct 7, 2025 •

edited

Loading

feilong-liu commented Oct 10, 2025 •

edited

Loading

kaikalur commented Oct 11, 2025 •

edited

Loading