Skip to content

[LV] Convert gather loads with invariant stride into strided loads #147297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Mel-Chen
Copy link
Contributor

@Mel-Chen Mel-Chen commented Jul 7, 2025

This patch detects stride memory accesses such as:

  void stride(int* a, int *b, int n) {
    for (int i = 0; i < n; i++)
      a[i * 5] = b[i * 5] + i;
  }

and converts widen non-consecutive loads (i.e. gathers) into strided loads when profitable and legal.
The transformation is implemented as part of VPlan and is inspired by existing logic in RISCVGatherScatterLowering. Some of the legality and analysis logic has been moved into VPlan to enable this conversion earlier during vectorization planning.

This enables more efficient code generation for targets like RISC-V that support strided loads natively.

@Mel-Chen
Copy link
Contributor Author

Mel-Chen commented Jul 7, 2025

This is a port of the approach used in RISCVGatherScatterLowering, implemented entirely in VPlan. If you'd prefer to switch to LLVM IR–based analysis using SCEV, please let me know.

That said, the current patch faces an awkward situation: all existing strided access patterns in our lit tests are currently not converted to strided accesses, because the cost returned by TTI.getStridedMemoryOpCost is higher than that of a gather.

As a result, I'm marking this patch as Draft for now, until we have test cases that can demonstrate the intended functionality.

Copy link

github-actions bot commented Jul 7, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@lukel97
Copy link
Contributor

lukel97 commented Jul 8, 2025

+1 on this approach of doing at as VPlan transformation. That way we don't need to work with the legacy cost model.

I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?

That said, the current patch faces an awkward situation: all existing strided access patterns in our lit tests are currently not converted to strided accesses, because the cost returned by TTI.getStridedMemoryOpCost is higher than that of a gather.

Yeah I think the current RISCVTTIImpl::getStridedMemoryOpCost is too expensive. It should definitely be at least cheaper than RISCVTTIImpl::getGatherScatterOpCost, and at the moment the cost model seems to return the same for them.

I know the spacemit-x60 doesn't have fast vlse but I don't think they're as slow as vluxei. Do we have benchmark results for these? cc) @mikhailramalho

@wangpc-pp
Copy link
Contributor

I know the spacemit-x60 doesn't have fast vlse but I don't think they're as slow as vluxei. Do we have benchmark results for these? cc) @mikhailramalho

I have some data on C908 for strided load (which is weird): camel-cdr/rvv-bench#12. The result may hold for spacemit-x60.

@Mel-Chen
Copy link
Contributor Author

+1 on this approach of doing at as VPlan transformation. That way we don't need to work with the legacy cost model.

Another approach is using SCEV to get the base and stride. Both can be transformed within VPlanTransform without relying on the legacy cost model.

I presume we will need to handle VPWidenStridedLoadRecipes in planContainsAdditionalSimplifcations and skip the assertion?

I’ve previously forced the cost model to always return profitable in order to trigger the transformation, and didn’t encounter any assertions. I plan to use the same method to test against the llvm-test-suite and see if any issues come up.

@Mel-Chen
Copy link
Contributor Author

Mel-Chen commented Jul 22, 2025

@lukel97 @wangpc-pp
#149955
Elvis is working on improving the gather/scatter cost model. This should help enable strided accesses.


auto *LoadR = cast<VPWidenLoadRecipe>(MemR);
auto *StridedLoad = new VPWidenStridedLoadRecipe(
*cast<LoadInst>(&Ingredient), NewPtr, Stride, &Plan.getVF(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the details of RISC-V or the experimental_vp_strided_load intrinsic, so my apologies if my understanding is incorrect.
My understanding is that the Stride is already in bytes. However, it seems to be multiplied to be in bytes in VPWidenStridedLoadRecipe::execute. Is this intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants