-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Add optimization to randomize null key in semi join #26251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideIntroduces a new optimizer to randomize null join keys in semi joins, driven by a configurable strategy, and integrates it into the planning pipeline with accompanying configuration and tests. Class diagram for new and updated join key randomization typesclassDiagram
class FeaturesConfig {
-boolean nativeEnforceJoinBuildInputPartition
-boolean randomizeOuterJoinNullKey
-RandomizeOuterJoinNullKeyStrategy randomizeOuterJoinNullKeyStrategy
-RandomizeSemiJoinNullKeyStrategy randomizeSemiJoinNullKeyStrategy
-ShardedJoinStrategy shardedJoinStrategy
-int joinShardCount
-boolean isOptimizeConditionalAggregationEnabled
+RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy()
+FeaturesConfig setRandomizeSemiJoinNullKeyStrategy(RandomizeSemiJoinNullKeyStrategy)
}
class RandomizeOuterJoinNullKeyStrategy {
<<enum>>
DISABLED
ALWAYS
}
class RandomizeSemiJoinNullKeyStrategy {
<<enum>>
DISABLED
ALWAYS
}
FeaturesConfig --> RandomizeOuterJoinNullKeyStrategy
FeaturesConfig --> RandomizeSemiJoinNullKeyStrategy
class SystemSessionProperties {
+RandomizeSemiJoinNullKeyStrategy getRandomizeSemiJoinNullKeyStrategy(Session)
}
SystemSessionProperties --> RandomizeSemiJoinNullKeyStrategy
Class diagram for the new RandomizeNullKeyInSemiJoin optimizerclassDiagram
class RandomizeNullKeyInSemiJoin {
-FunctionAndTypeManager functionAndTypeManager
-boolean isEnabledForTesting
+RandomizeNullKeyInSemiJoin(FunctionAndTypeManager)
+void setEnabledForTesting(boolean)
+boolean isEnabled(Session)
+PlanOptimizerResult optimize(PlanNode, Session, TypeProvider, VariableAllocator, PlanNodeIdAllocator, WarningCollector)
}
class SemiJoinUsageCollector {
+void collect(PlanNode)
+Set<SemiJoinNode> getEligibleSemiJoins()
}
class Rewriter {
-Session session
-FunctionAndTypeManager functionAndTypeManager
-PlanNodeIdAllocator planNodeIdAllocator
-VariableAllocator planVariableAllocator
-boolean planChanged
+boolean isPlanChanged()
+PlanNode visitSemiJoin(SemiJoinNode, RewriteContext<Set<SemiJoinNode>>)
}
RandomizeNullKeyInSemiJoin --> SemiJoinUsageCollector
RandomizeNullKeyInSemiJoin --> Rewriter
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/optimizations/TestRandomizeNullKeyInSemiJoin.java:42-51` </location>
<code_context>
+ false);
+ }
+
+ @Test
+ public void testSemiJoinWithUnsupportedJoinKeyType()
+ {
+ // Semi join with varchar keys - should NOT be optimized (only INTEGER, BIGINT, DATE are supported)
</code_context>
<issue_to_address>
**suggestion (testing):** Missing test for DATE join key type in semi join randomization.
Please add a test case for semi join randomization with a DATE join key to verify correct optimizer behavior.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
0ff190b
to
fd93f82
Compare
} | ||
|
||
private static class Rewriter | ||
extends SimplePlanRewriter<Set<SemiJoinNode>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we move some of the randomization logic to planner utils so it's common across the two optimizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the function to generate random join key into a separate util files.
For other parts of the rewriter, the code in the outer join optimizer is more complex, as it contains an additional option which tracks if the input key is from an outer join or not, and cost based trigger with HBO. Also outer joins can have multiple join keys and we do randomization for all join keys. While for semi join, we only have one join key. These will make extracting the rest of logic a bit complex. Consider this code is relatively small and we see it will only be used in these two cases, I would prefer to leave the rest of code as it is now.
actually we can always add randomize(x) in (select randomize(x) from t) to preserve null semantics instead of looking for context? |
When x is NULL, x in (select x from t) will be NULL, while randomize(x) in (select randomize(x) from t) AND if(x is null, null, true) will be false AND NULL which will still be false. And we may do some special handling for NULL keys on the left side, but we will still face problem for cases where the right side have NULLs |
fd93f82
to
5d5a9ce
Compare
Actually, do not randomize the right side! Remember we do distinct for right side. So if the above rewrite and not touching right side will preserve the semijoin/anti-semijoin semantics fully and no context dep |
Yeah - see my comment - don't salt/randomize right side! |
I see, so for null skew only on left side, we can do (randomize(x) in (select x from t)) OR (if(x is null, null, false))
I am not sure how much it will help for null skew on the right side, as we will still hit this skew problem for the distinct aggregation? Since for the semi join, one row with NULL key in the right side will be the same as multiple rows with NULL keys, if we can have an runtime optimization that, if we already send out one row NULL key, we can directly discard the rest rows with NULL keys, this will be very useful. |
I say just don't salt right side. It's extremely rare and also we should distinct that will remove skew. and no special context checks required at all with that. |
5d5a9ce
to
b132289
Compare
b132289
to
3c99f10
Compare
Updated the code to do salt only on the left side. |
3c99f10
to
761beb4
Compare
this.functionAndTypeManager = requireNonNull(functionAndTypeManager, "functionAndTypeManager is null"); | ||
} | ||
|
||
private static boolean isSupportedType(VariableReferenceExpression variable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why just these types? Seems like anything other than like char/varchar/varbinary where you might actually be generating real keys would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For float/double, I am not sure how cast to varchar works for it, so I didn't include them. For other types, I think maybe they are not used so widely in semi join, that's why I chose to be conservative in this PR. If we find more use cases for other types, we can revisit it.
Description
This PR is to add an optimization to deal with NULL skew on the source side/left side in SemiJoin.
The motivating query is of pattern like:
Here
l.partkey
orl.orderkey
has a significant portion of values to be NULL. For this case, we cannot simply filtering rows which have NULL in semi join key, as for rows withl.partkey
to be NULL,l.orderkey
may not be NULL and return true for the semi join. In order to mitigate the NULL skew, we apply a similar optimization as in https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/RandomizeNullKeyInOuterJoin.java where we coalesce the NULL keys to a new key on the left/source side.Notice that the above optimization will map the NULL output corresponding to NULL key to false, hence we apply one more projection at the end
semi_output := new_semi_output OR (source_key IS NULL ? NULL : FALSE)
. Here when source_key is NULL, we will have new_semi_output to be false, and semi_output back to NULL. While when source_key is not NULL,new_semi_output OR FALSE
is stillnew_semi_output
which does not change the result.Motivation and Context
As described above.
Impact
For the motivating production queries, we observe a latency drop from 3 hours to only 4 minutes.
Test Plan
Unit tests, also run verifier suites with production queries suite1
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.