Skip to content

Conversation

feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Oct 7, 2025

Description

This PR is to add an optimization to deal with NULL skew on the source side/left side in SemiJoin.
The motivating query is of pattern like:

SELECT
    *
FROM lineitem l
WHERE
    l.partkey IN (
        SELECT
            partkey
        FROM partsupp
    )
    OR l.orderkey IN (
        SELECT
            orderkey
        FROM orders
    );

Here l.partkey or l.orderkey has a significant portion of values to be NULL. For this case, we cannot simply filtering rows which have NULL in semi join key, as for rows with l.partkey to be NULL, l.orderkey may not be NULL and return true for the semi join. In order to mitigate the NULL skew, we apply a similar optimization as in https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/RandomizeNullKeyInOuterJoin.java where we coalesce the NULL keys to a new key on the left/source side.

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we apply one more projection at the end semi_output := new_semi_output OR (source_key IS NULL ? NULL : FALSE). Here when source_key is NULL, we will have new_semi_output to be false, and semi_output back to NULL. While when source_key is not NULL, new_semi_output OR FALSE is still new_semi_output which does not change the result.

Motivation and Context

As described above.

Impact

For the motivating production queries, we observe a latency drop from 3 hours to only 4 minutes.

Test Plan

Unit tests, also run verifier suites with production queries suite1

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a new optimizer which do null skew mitigation for applicable semi joins

@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Oct 7, 2025
@feilong-liu feilong-liu marked this pull request as draft October 7, 2025 23:23
Copy link
Contributor

sourcery-ai bot commented Oct 7, 2025

Reviewer's Guide

Introduces a new optimizer to randomize null join keys in semi joins, driven by a configurable strategy, and integrates it into the planning pipeline with accompanying configuration and tests.

Class diagram for new and updated join key randomization types

classDiagram
    class FeaturesConfig {
        -boolean nativeEnforceJoinBuildInputPartition
        -boolean randomizeOuterJoinNullKey
        -RandomizeOuterJoinNullKeyStrategy randomizeOuterJoinNullKeyStrategy
        -RandomizeSemiJoinNullKeyStrategy randomizeSemiJoinNullKeyStrategy
        -ShardedJoinStrategy shardedJoinStrategy
        -int joinShardCount
        -boolean isOptimizeConditionalAggregationEnabled
        +RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy()
        +FeaturesConfig setRandomizeSemiJoinNullKeyStrategy(RandomizeSemiJoinNullKeyStrategy)
    }
    class RandomizeOuterJoinNullKeyStrategy {
        <<enum>>
        DISABLED
        ALWAYS
    }
    class RandomizeSemiJoinNullKeyStrategy {
        <<enum>>
        DISABLED
        ALWAYS
    }
    FeaturesConfig --> RandomizeOuterJoinNullKeyStrategy
    FeaturesConfig --> RandomizeSemiJoinNullKeyStrategy
    class SystemSessionProperties {
        +RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy(Session)
    }
    SystemSessionProperties --> RandomizeSemiJoinNullKeyStrategy
Loading

Class diagram for the new RandomizeNullKeyInSemiJoin optimizer

classDiagram
    class RandomizeNullKeyInSemiJoin {
        -FunctionAndTypeManager functionAndTypeManager
        -boolean isEnabledForTesting
        +RandomizeNullKeyInSemiJoin(FunctionAndTypeManager)
        +void setEnabledForTesting(boolean)
        +boolean isEnabled(Session)
        +PlanOptimizerResult optimize(PlanNode, Session, TypeProvider, VariableAllocator, PlanNodeIdAllocator, WarningCollector)
    }
    class SemiJoinUsageCollector {
        +void collect(PlanNode)
        +Set<SemiJoinNode> getEligibleSemiJoins()
    }
    class Rewriter {
        -Session session
        -FunctionAndTypeManager functionAndTypeManager
        -PlanNodeIdAllocator planNodeIdAllocator
        -VariableAllocator planVariableAllocator
        -boolean planChanged
        +boolean isPlanChanged()
        +PlanNode visitSemiJoin(SemiJoinNode, RewriteContext<Set<SemiJoinNode>>)
    }
    RandomizeNullKeyInSemiJoin --> SemiJoinUsageCollector
    RandomizeNullKeyInSemiJoin --> Rewriter
Loading

File-Level Changes

Change Details Files
Define and expose new semi join null-key randomization strategy in feature configuration
  • Add RandomizeSemiJoinNullKeyStrategy enum (DISABLED, ALWAYS)
  • Introduce randomizeSemiJoinNullKeyStrategy field with default DISABLED
  • Provide getter, @config setter, and description
FeaturesConfig.java
Add session property for semi join null-key randomization
  • Define RANDOMIZE_SEMI_JOIN_NULL_KEY_STRATEGY constant
  • Register PropertyMetadata with allowed enum values
  • Implement getRandomizeSemiJoinNullKeyStrategy()
SystemSessionProperties.java
Hook up the semi join randomization optimizer in the planner
  • Insert RandomizeNullKeyInSemiJoin into the optimizer sequence
  • Preserve existing outer join randomization placement
PlanOptimizers.java
Implement RandomizeNullKeyInSemiJoin optimization
  • Traverse plan to identify eligible semi joins based on output usage
  • Inject randomized key expressions via Project nodes
  • Reuse random keys for multiple joins on the same variable
  • Support enabling via session strategy or testing flag
RandomizeNullKeyInSemiJoin.java
Add and update tests for semi join null-key randomization
  • Create comprehensive TestRandomizeNullKeyInSemiJoin covering eligibility and edge cases
  • Extend TestFeaturesConfig to include mapping and defaults for new strategy
TestRandomizeNullKeyInSemiJoin.java
TestFeaturesConfig.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/optimizations/TestRandomizeNullKeyInSemiJoin.java:42-51` </location>
<code_context>
+                false);
+    }
+
+    @Test
+    public void testSemiJoinWithUnsupportedJoinKeyType()
+    {
+        // Semi join with varchar keys - should NOT be optimized (only INTEGER, BIGINT, DATE are supported)
</code_context>

<issue_to_address>
**suggestion (testing):** Missing test for DATE join key type in semi join randomization.

Please add a test case for semi join randomization with a DATE join key to verify correct optimizer behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@feilong-liu feilong-liu changed the title Add optimization to randomize null key in semi join feat: Add optimization to randomize null key in semi join Oct 7, 2025
@feilong-liu feilong-liu marked this pull request as ready for review October 9, 2025 17:49
}

private static class Rewriter
extends SimplePlanRewriter<Set<SemiJoinNode>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move some of the randomization logic to planner utils so it's common across the two optimizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the function to generate random join key into a separate util files.
For other parts of the rewriter, the code in the outer join optimizer is more complex, as it contains an additional option which tracks if the input key is from an outer join or not, and cost based trigger with HBO. Also outer joins can have multiple join keys and we do randomization for all join keys. While for semi join, we only have one join key. These will make extracting the rest of logic a bit complex. Consider this code is relatively small and we see it will only be used in these two cases, I would prefer to leave the rest of code as it is now.

@kaikalur
Copy link
Contributor

kaikalur commented Oct 9, 2025

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t)
AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

@feilong-liu
Copy link
Contributor Author

feilong-liu commented Oct 10, 2025

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

When x is NULL, x in (select x from t) will be NULL, while randomize(x) in (select randomize(x) from t) AND if(x is null, null, true) will be false AND NULL which will still be false.

And we may do some special handling for NULL keys on the left side, but we will still face problem for cases where the right side have NULLs

@kaikalur
Copy link
Contributor

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add

randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)

to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

@kaikalur
Copy link
Contributor

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

When x is NULL, x in (select x from t) will be NULL, while randomize(x) in (select randomize(x) from t) AND if(x is null, null, true) will be false AND NULL which will still be false.

And we may do some special handling for NULL keys on the left side, but we will still face problem for cases where the right side have NULLs

Yeah - see my comment - don't salt/randomize right side!

@feilong-liu
Copy link
Contributor Author

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false))
And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?

Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

@kaikalur
Copy link
Contributor

kaikalur commented Oct 11, 2025

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false)) And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?

Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

I say just don't salt right side. It's extremely rare and also we should distinct that will remove skew. and no special context checks required at all with that.

@feilong-liu
Copy link
Contributor Author

Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we cannot apply to anti semi joins, i.e. for NOT IN, and even if it's semi join, the output cannot be projected to output, and cannot be used in expressions which may return different results for NULL and false input (here for simplicity we disable the optimization if the output is used in any functions). This is guaranteed by the SemiJoinCollector visitor in the optimizer.

actually we can always add
randomize(x) in (select randomize(x) from t) AND if(x is null, null, true)
to preserve null semantics instead of looking for context?

Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep

I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false)) And for right side, we can randomize only when the semi join result usage follows the constraint mentioned in the description. Actually under this constraint, we can directly filter out the null keys instead, which is easier and more efficient.

Remember we do distinct for right side

I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation?
Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful.

I say just don't salt right side. It's extremely rare and also we should distinct that will remove skew. and no special context checks required at all with that.

Updated the code to do salt only on the left side.

this.functionAndTypeManager = requireNonNull(functionAndTypeManager, "functionAndTypeManager is null");
}

private static boolean isSupportedType(VariableReferenceExpression variable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why just these types? Seems like anything other than like char/varchar/varbinary where you might actually be generating real keys would be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For float/double, I am not sure how cast to varchar works for it, so I didn't include them. For other types, I think maybe they are not used so widely in semi join, that's why I chose to be conservative in this PR. If we find more use cases for other types, we can revisit it.

@feilong-liu feilong-liu merged commit d1c24bb into prestodb:master Oct 14, 2025
77 of 79 checks passed
@feilong-liu feilong-liu deleted the semi_random branch October 14, 2025 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants