Skip to content

Conversation

yangshuxin
Copy link
Contributor

@yangshuxin yangshuxin commented Oct 5, 2025

This change fix bugs in range-analysis, and let buffer-ops use the range-analysis result to decide if it's legal to convert memory-op to buffer-ops.

It fixes https://github.com/ROCm/triton-internal/issues/1180 and ROCm#871.

The highlight are following:

  • Range Analysis

    • fix the way to use tl.assume. Previously, it does not consider the control flow relationship between, say tl.assume x > 0 and the location of occurrence of x.
    • correct the value range of make_range(begin, end), previous vr is [begin, end], now is [begin, end-1]. Small change in concept incur huge change the regression test.
  • Buffer-ops

  • for large tensor (>2G), remove the ad-hoc, and mistaken range-analysis in the pass. It only relies on the result of the range-analysis pass.
  • previous, buffer-ops pass only check element-index > 0. The right condition is byte-offset in [0, 2G-element-size].
  • Previous there is a similar work here [AMD] Update BufferOps Non-Negative Check to be in Bytes #7908, contributed by @njriasan . My change to this part is similar but fix some bugs in PR7908 (.e.g. lattice could be nullptr), and update large number of testings. That being said, now that @njriasan made the first change, credit for the part belong to him.

@yangshuxin
Copy link
Contributor Author

I set change the status to "ready for review" despite failures on Nvidia platform. My change is only about AMD platform, I guess my change is not the culprit.

@yangshuxin yangshuxin marked this pull request as ready for review October 7, 2025 18:12
@yangshuxin yangshuxin changed the title [AMD][DRAFT] revamp range analysis [AMD] revamp range analysis Oct 7, 2025
@yangshuxin yangshuxin force-pushed the shuxin/revamp_value_range branch from 0289d6c to 68eff64 Compare October 9, 2025 19:53
@yangshuxin yangshuxin changed the title [AMD] revamp range analysis [AMD] enhance range analysis and buffer-op only relies on range-analysis Oct 9, 2025
Copy link
Contributor

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review the range analysis, but this looks good so far!

// step 2: Get element type and size.
// e.g. addPtrOp.getType is tensor<64x64x!tt.ptr<f16>, then elemTy is
// !tt.ptr<f16>, and dereferencing elemTy gets f16.
// TODO: Not sure if we need to keep dereferencing in a loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience was that yes you do need to do this.

<< ((szLimit2GB > byteOfst) ? ", out or range"
: ",in range"));

return byteOfst <= szLimit2GB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you!

solver->load<AMD::TritonIntegerRangeAnalysis>(assumptions);
solver->load<AMD::TritonIntegerRangeAnalysis>(
assumptions, &getAnalysis<DominanceInfo>(),
/*assumeNoArithOverflow=*/true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hack or what is the equivalent when a user writes Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njriasan thank you very much for code reviewing, and your initial implementation.

For this specific change, I put a comment to the RangeAnalysis.cpp. The initial motivation can be explained by this contrived example:

 a and b were used in an operation indicating it is signed int. 
 c = a + b

We know both a and b in [0, smax). if it will not overflow, the c will fall in [0, smax), otherwise c will be in `(smin, smax). This flag is to tell if arithmetic operation never overflow.

This "optimization" seems to be useful in proving > 0, however, pretty useless for proving < specified value. I decided to ditch this "optimization" in the first place, it might keep it for a while to see if such "optimzation" is useful in the future.

For now, this optimization is turned off. Just turned on for testing. Tuning it on for testing only see only small difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this part since both you and Lei are confused.

The confusing feature is to perform value-range analysis assuming arithmetic op has nsw and nuw flags (even they are not present)
e.g. pid * block_size will still fit in [0, smax] despite that the pid itself is in [0, smax]
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these correctness issues! I'm good with the current impl; we can incrementally work to improve it for perf going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants