-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[AMD] enhance range analysis and buffer-op only relies on range-analysis #8372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c610cac
to
8cf91be
Compare
I set change the status to "ready for review" despite failures on Nvidia platform. My change is only about AMD platform, I guess my change is not the culprit. |
0289d6c
to
68eff64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review the range analysis, but this looks good so far!
// step 2: Get element type and size. | ||
// e.g. addPtrOp.getType is tensor<64x64x!tt.ptr<f16>, then elemTy is | ||
// !tt.ptr<f16>, and dereferencing elemTy gets f16. | ||
// TODO: Not sure if we need to keep dereferencing in a loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experience was that yes you do need to do this.
<< ((szLimit2GB > byteOfst) ? ", out or range" | ||
: ",in range")); | ||
|
||
return byteOfst <= szLimit2GB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you!
solver->load<AMD::TritonIntegerRangeAnalysis>(assumptions); | ||
solver->load<AMD::TritonIntegerRangeAnalysis>( | ||
assumptions, &getAnalysis<DominanceInfo>(), | ||
/*assumeNoArithOverflow=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a hack or what is the equivalent when a user writes Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njriasan thank you very much for code reviewing, and your initial implementation.
For this specific change, I put a comment to the RangeAnalysis.cpp
. The initial motivation can be explained by this contrived example:
a and b were used in an operation indicating it is signed int.
c = a + b
We know both a and b in [0, smax). if it will not overflow, the c will fall in [0, smax), otherwise c will be in `(smin, smax). This flag is to tell if arithmetic operation never overflow.
This "optimization" seems to be useful in proving > 0, however, pretty useless for proving < specified value. I decided to ditch this "optimization" in the first place, it might keep it for a while to see if such "optimzation" is useful in the future.
For now, this optimization is turned off. Just turned on for testing. Tuning it on for testing only see only small difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this part since both you and Lei are confused.
third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp
Outdated
Show resolved
Hide resolved
The confusing feature is to perform value-range analysis assuming arithmetic op has nsw and nuw flags (even they are not present) e.g. pid * block_size will still fit in [0, smax] despite that the pid itself is in [0, smax]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing these correctness issues! I'm good with the current impl; we can incrementally work to improve it for perf going forward.
This change fix bugs in range-analysis, and let buffer-ops use the range-analysis result to decide if it's legal to convert memory-op to buffer-ops.
It fixes https://github.com/ROCm/triton-internal/issues/1180 and ROCm#871.
The highlight are following:
Range Analysis
tl.assume
. Previously, it does not consider the control flow relationship between, saytl.assume x > 0
and the location of occurrence of x.make_range(begin, end)
, previous vr is [begin, end], now is [begin, end-1]. Small change in concept incur huge change the regression test.Buffer-ops