[VPlan] Extract reverse operation for reverse accesses #146525

Mel-Chen · 2025-07-01T13:22:57Z

This patch introduces VPInstruction::Reverse and extracts the reverse operations of loaded/stored values from reverse memory accesses. This extraction facilitates future support for permutation elimination within VPlan.

llvmbot · 2025-07-01T13:23:26Z

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Mel Chen (Mel-Chen)

Changes

This patch introduces VPInstruction::Reverse and extracts the reverse operations of loaded/stored values from reverse memory accesses. This extraction facilitates future support for permutation elimination within VPlan.

Patch is 69.62 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146525.diff

18 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+6)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+18-29)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+39)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+14)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse-output.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+28-16)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll (+50-50)
(modified) llvm/test/Transforms/LoopVectorize/interleave-with-i65-induction.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/reverse_induction.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+2-1)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 67a51c12b508e..d5aeb4feb19ba 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1541,6 +1541,12 @@ RISCVTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                           cast<VectorType>(ICA.getArgTypes()[0]), {}, CostKind,
                           0, cast<VectorType>(ICA.getReturnType()));
   }
+  case Intrinsic::experimental_vp_reverse: {
+    return getShuffleCost(TTI::SK_Reverse,
+                          cast<VectorType>(ICA.getReturnType()),
+                          cast<VectorType>(ICA.getArgTypes()[0]), {}, CostKind,
+                          0, cast<VectorType>(ICA.getReturnType()));
+  }
   }
 
   if (ST->hasVInstructions() && RetTy->isVectorTy()) {
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b01c8b02ec66a..94782c33f5bda 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8880,6 +8880,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
   // bring the VPlan to its final state.
   // ---------------------------------------------------------------------------
 
+  // Adjust the result of reverse memory accesses.
+  VPlanTransforms::runPass(VPlanTransforms::adjustRecipesForReverseAccesses,
+                           *Plan);
+
   // Adjust the recipes for any inloop reductions.
   adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 61b5ccd85bc6e..55175a889d0e0 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -970,6 +970,8 @@ class VPInstruction : public VPRecipeWithIRFlags,
     // It produces the lane index across all unrolled iterations. Unrolling will
     // add all copies of its original operand as additional operands.
     FirstActiveLane,
+    // Returns a reversed vector for the operand.
+    Reverse,
 
     // The opcodes below are used for VPInstructionWithType.
     //
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index f3b99fe34c069..f87b6de42c8b8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -126,6 +126,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
     return IntegerType::get(Ctx, 1);
   case VPInstruction::Broadcast:
   case VPInstruction::PtrAdd:
+  case VPInstruction::Reverse:
     // Return the type based on first operand.
     return inferScalarType(R->getOperand(0));
   case VPInstruction::BranchOnCond:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 1a38932ef99fe..b4ed4ef3147c6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -444,6 +444,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
   case VPInstruction::ExtractPenultimateElement:
   case VPInstruction::FirstActiveLane:
   case VPInstruction::Not:
+  case VPInstruction::Reverse:
     return 1;
   case Instruction::ICmp:
   case Instruction::FCmp:
@@ -873,6 +874,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
 
     return Res;
   }
+  case VPInstruction::Reverse: {
+    return Builder.CreateVectorReverse(State.get(getOperand(0)), "reverse");
+  }
   default:
     llvm_unreachable("Unsupported opcode for instruction");
   }
@@ -948,6 +952,13 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
                                   I32Ty, {Arg0Ty, I32Ty, I1Ty});
     return Ctx.TTI.getIntrinsicInstrCost(Attrs, Ctx.CostKind);
   }
+  case VPInstruction::Reverse: {
+    assert(VF.isVector() && "Reverse operation must be vector type");
+    Type *VectorTy = toVectorTy(Ctx.Types.inferScalarType(this), VF);
+    return Ctx.TTI.getShuffleCost(
+        TargetTransformInfo::SK_Reverse, cast<VectorType>(VectorTy),
+        cast<VectorType>(VectorTy), {}, Ctx.CostKind, 0);
+  }
   case VPInstruction::ExtractPenultimateElement:
     if (VF == ElementCount::getScalable(1))
       return InstructionCost::getInvalid();
@@ -1033,6 +1044,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::WideIVStep:
   case VPInstruction::StepVector:
   case VPInstruction::ReductionStartVector:
+  case VPInstruction::Reverse:
     return false;
   default:
     return true;
@@ -1179,6 +1191,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::ReductionStartVector:
     O << "reduction-start-vector";
     break;
+  case VPInstruction::Reverse:
+    O << "reverse";
+    break;
   default:
     O << Instruction::getOpcodeName(getOpcode());
   }
@@ -2967,12 +2982,7 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF,
     Cost += Ctx.TTI.getMemoryOpCost(Opcode, Ty, Alignment, AS, Ctx.CostKind,
                                     OpInfo, &Ingredient);
   }
-  if (!Reverse)
-    return Cost;
-
-  return Cost += Ctx.TTI.getShuffleCost(
-             TargetTransformInfo::SK_Reverse, cast<VectorType>(Ty),
-             cast<VectorType>(Ty), {}, Ctx.CostKind, 0);
+  return Cost;
 }
 
 void VPWidenLoadRecipe::execute(VPTransformState &State) {
@@ -3004,8 +3014,6 @@ void VPWidenLoadRecipe::execute(VPTransformState &State) {
     NewLI = Builder.CreateAlignedLoad(DataTy, Addr, Alignment, "wide.load");
   }
   applyMetadata(*cast<Instruction>(NewLI));
-  if (Reverse)
-    NewLI = Builder.CreateVectorReverse(NewLI, "reverse");
   State.set(this, NewLI);
 }
 
@@ -3061,8 +3069,6 @@ void VPWidenLoadEVLRecipe::execute(VPTransformState &State) {
       0, Attribute::getWithAlignment(NewLI->getContext(), Alignment));
   applyMetadata(*NewLI);
   Instruction *Res = NewLI;
-  if (isReverse())
-    Res = createReverseEVL(Builder, Res, EVL, "vp.reverse");
   State.set(this, Res);
 }
 
@@ -3083,12 +3089,8 @@ InstructionCost VPWidenLoadEVLRecipe::computeCost(ElementCount VF,
       getLoadStoreAddressSpace(const_cast<Instruction *>(&Ingredient));
   InstructionCost Cost = Ctx.TTI.getMaskedMemoryOpCost(
       Instruction::Load, Ty, Alignment, AS, Ctx.CostKind);
-  if (!Reverse)
-    return Cost;
 
-  return Cost + Ctx.TTI.getShuffleCost(
-                    TargetTransformInfo::SK_Reverse, cast<VectorType>(Ty),
-                    cast<VectorType>(Ty), {}, Ctx.CostKind, 0);
+  return Cost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
@@ -3118,13 +3120,6 @@ void VPWidenStoreRecipe::execute(VPTransformState &State) {
   }
 
   Value *StoredVal = State.get(StoredVPValue);
-  if (isReverse()) {
-    // If we store to reverse consecutive memory locations, then we need
-    // to reverse the order of elements in the stored value.
-    StoredVal = Builder.CreateVectorReverse(StoredVal, "reverse");
-    // We don't want to update the value in the map as it might be used in
-    // another expression. So don't call resetVectorValue(StoredVal).
-  }
   Value *Addr = State.get(getAddr(), /*IsScalar*/ !CreateScatter);
   Instruction *NewSI = nullptr;
   if (CreateScatter)
@@ -3154,8 +3149,6 @@ void VPWidenStoreEVLRecipe::execute(VPTransformState &State) {
   CallInst *NewSI = nullptr;
   Value *StoredVal = State.get(StoredValue);
   Value *EVL = State.get(getEVL(), VPLane(0));
-  if (isReverse())
-    StoredVal = createReverseEVL(Builder, StoredVal, EVL, "vp.reverse");
   Value *Mask = nullptr;
   if (VPValue *VPMask = getMask()) {
     Mask = State.get(VPMask);
@@ -3196,12 +3189,8 @@ InstructionCost VPWidenStoreEVLRecipe::computeCost(ElementCount VF,
       getLoadStoreAddressSpace(const_cast<Instruction *>(&Ingredient));
   InstructionCost Cost = Ctx.TTI.getMaskedMemoryOpCost(
       Instruction::Store, Ty, Alignment, AS, Ctx.CostKind);
-  if (!Reverse)
-    return Cost;
 
-  return Cost + Ctx.TTI.getShuffleCost(
-                    TargetTransformInfo::SK_Reverse, cast<VectorType>(Ty),
-                    cast<VectorType>(Ty), {}, Ctx.CostKind, 0);
+  return Cost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 730deb0686b2a..cf41b6d00f285 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2172,6 +2172,14 @@ static VPRecipeBase *createEVLRecipe(VPValue *HeaderMask,
                                             VPI->getDebugLoc());
         }
 
+        if (VPI->getOpcode() == VPInstruction::Reverse) {
+          SmallVector<VPValue *> Ops(VPI->operands());
+          Ops.append({&AllOneMask, &EVL});
+          return new VPWidenIntrinsicRecipe(Intrinsic::experimental_vp_reverse,
+                                            Ops, TypeInfo.inferScalarType(VPI),
+                                            VPI->getDebugLoc());
+        }
+
         VPValue *LHS, *RHS;
         // Transform select with a header mask condition
         //   select(header_mask, LHS, RHS)
@@ -3347,3 +3355,34 @@ void VPlanTransforms::addBranchWeightToMiddleTerminator(VPlan &Plan,
       MDB.createBranchWeights({1, VectorStep - 1}, /*IsExpected=*/false);
   MiddleTerm->addMetadata(LLVMContext::MD_prof, BranchWeights);
 }
+
+void VPlanTransforms::adjustRecipesForReverseAccesses(VPlan &Plan) {
+  if (Plan.hasScalarVFOnly())
+    return;
+
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+           vp_depth_first_deep(Plan.getVectorLoopRegion()))) {
+    for (VPRecipeBase &R : *VPBB) {
+      auto *MemR = dyn_cast<VPWidenMemoryRecipe>(&R);
+      if (!MemR || !MemR->isReverse())
+        continue;
+
+      if (auto *L = dyn_cast<VPWidenLoadRecipe>(MemR)) {
+        auto *Reverse =
+            new VPInstruction(VPInstruction::Reverse, {L}, L->getDebugLoc());
+        Reverse->insertAfter(L);
+        L->replaceAllUsesWith(Reverse);
+        Reverse->setOperand(0, L);
+        continue;
+      }
+
+      if (auto *S = dyn_cast<VPWidenStoreRecipe>(MemR)) {
+        VPValue *StoredVal = S->getStoredValue();
+        auto *Reverse = new VPInstruction(VPInstruction::Reverse, {StoredVal},
+                                          S->getDebugLoc());
+        Reverse->insertBefore(S);
+        S->setOperand(1, Reverse);
+      }
+    }
+  }
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 40885cd52a127..abe592247e2de 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -239,6 +239,20 @@ struct VPlanTransforms {
   /// Add branch weight metadata, if the \p Plan's middle block is terminated by
   /// a BranchOnCond recipe.
   static void addBranchWeightToMiddleTerminator(VPlan &Plan, ElementCount VF);
+
+  /// Add reverse recipes for reverse memory accesses.
+  /// For reverse loads, transform
+  ///   WIDEN ir<%L> = load vp<%addr>
+  /// into
+  ///   WIDEN ir<%L> = load vp<%addr>
+  ///   EMIT   vp<%RevL> = reverse ir<%L>
+  ///
+  /// For reverse stores, transform
+  ///   WIDEN store vp<%addr>, ir<%SVal>
+  /// into
+  ///   EMIT   vp<%RevS> = reverse ir<%SVal>
+  ///   WIDEN  store vp<%addr>, vp<%RevS>
+  static void adjustRecipesForReverseAccesses(VPlan &Plan);
 };
 
 } // namespace llvm
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll
index 9485d827ced40..c838c63545341 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll
@@ -22,8 +22,8 @@ define void @vector_reverse_mask_nxv4i1(ptr %a, ptr %cond, i64 %N) #0 {
 ; CHECK: %[[WIDEMSKLOAD:.*]] = call <vscale x 4 x double> @llvm.masked.load.nxv4f64.p0(ptr %{{.*}}, i32 8, <vscale x 4 x i1> %[[REVERSE6]], <vscale x 4 x double> poison)
 ; CHECK: %[[REVERSE7:.*]] = call <vscale x 4 x double> @llvm.vector.reverse.nxv4f64(<vscale x 4 x double> %[[WIDEMSKLOAD]])
 ; CHECK: %[[FADD:.*]] = fadd <vscale x 4 x double> %[[REVERSE7]]
-; CHECK: %[[REVERSE9:.*]] = call <vscale x 4 x i1> @llvm.vector.reverse.nxv4i1(<vscale x 4 x i1> %{{.*}})
 ; CHECK: %[[REVERSE8:.*]] = call <vscale x 4 x double> @llvm.vector.reverse.nxv4f64(<vscale x 4 x double> %[[FADD]])
+; CHECK: %[[REVERSE9:.*]] = call <vscale x 4 x i1> @llvm.vector.reverse.nxv4i1(<vscale x 4 x i1> %{{.*}})
 ; CHECK: call void @llvm.masked.store.nxv4f64.p0(<vscale x 4 x double> %[[REVERSE8]], ptr %{{.*}}, i32 8, <vscale x 4 x i1> %[[REVERSE9]]
 
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll b/llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll
index 1dd49ecf85b81..d6f619cce54a0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll
@@ -37,8 +37,8 @@ define void @vector_reverse_mask_v4i1(ptr noalias %a, ptr noalias %cond, i64 %N)
 ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i64 -24
 ; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i64 -56
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x double>, ptr [[TMP3]], align 8
-; CHECK-NEXT:    [[REVERSE:%.*]] = shufflevector <4 x double> [[WIDE_LOAD]], <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
 ; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x double>, ptr [[TMP4]], align 8
+; CHECK-NEXT:    [[REVERSE:%.*]] = shufflevector <4 x double> [[WIDE_LOAD]], <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
 ; CHECK-NEXT:    [[REVERSE2:%.*]] = shufflevector <4 x double> [[WIDE_LOAD1]], <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
 ; CHECK-NEXT:    [[TMP5:%.*]] = fcmp une <4 x double> [[REVERSE]], zeroinitializer
 ; CHECK-NEXT:    [[TMP6:%.*]] = fcmp une <4 x double> [[REVERSE2]], zeroinitializer
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse-output.ll b/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse-output.ll
index 09b274de30214..6d55f7369f01e 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse-output.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse-output.ll
@@ -165,8 +165,8 @@ define void @vector_reverse_i32(ptr noalias %A, ptr noalias %B) {
 ; RV64-UF2-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i32, ptr [[TMP10]], i64 [[TMP15]]
 ; RV64-UF2-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i32, ptr [[TMP17]], i64 [[TMP16]]
 ; RV64-UF2-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP14]], align 4
-; RV64-UF2-NEXT:    [[REVERSE:%.*]] = call <vscale x 4 x i32> @llvm.vector.reverse.nxv4i32(<vscale x 4 x i32> [[WIDE_LOAD]])
 ; RV64-UF2-NEXT:    [[WIDE_LOAD1:%.*]] = load <vscale x 4 x i32>, ptr [[TMP18]], align 4
+; RV64-UF2-NEXT:    [[REVERSE:%.*]] = call <vscale x 4 x i32> @llvm.vector.reverse.nxv4i32(<vscale x 4 x i32> [[WIDE_LOAD]])
 ; RV64-UF2-NEXT:    [[REVERSE2:%.*]] = call <vscale x 4 x i32> @llvm.vector.reverse.nxv4i32(<vscale x 4 x i32> [[WIDE_LOAD1]])
 ; RV64-UF2-NEXT:    [[TMP19:%.*]] = add <vscale x 4 x i32> [[REVERSE]], splat (i32 1)
 ; RV64-UF2-NEXT:    [[TMP20:%.*]] = add <vscale x 4 x i32> [[REVERSE2]], splat (i32 1)
@@ -180,8 +180,8 @@ define void @vector_reverse_i32(ptr noalias %A, ptr noalias %B) {
 ; RV64-UF2-NEXT:    [[TMP28:%.*]] = getelementptr inbounds i32, ptr [[TMP21]], i64 [[TMP26]]
 ; RV64-UF2-NEXT:    [[TMP29:%.*]] = getelementptr inbounds i32, ptr [[TMP28]], i64 [[TMP27]]
 ; RV64-UF2-NEXT:    [[REVERSE3:%.*]] = call <vscale x 4 x i32> @llvm.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP19]])
-; RV64-UF2-NEXT:    store <vscale x 4 x i32> [[REVERSE3]], ptr [[TMP25]], align 4
 ; RV64-UF2-NEXT:    [[REVERSE4:%.*]] = call <vscale x 4 x i32> @llvm.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP20]])
+; RV64-UF2-NEXT:    store <vscale x 4 x i32> [[REVERSE3]], ptr [[TMP25]], align 4
 ; RV64-UF2-NEXT:    store <vscale x 4 x i32> [[REVERSE4]], ptr [[TMP29]], align 4
 ; RV64-UF2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
 ; RV64-UF2-NEXT:    [[TMP30:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
@@ -371,8 +371,8 @@ define void @vector_reverse_f32(ptr noalias %A, ptr noalias %B) {
 ; RV64-UF2-NEXT:    [[TMP17:%.*]] = getelementptr inbounds float, ptr [[TMP10]], i64 [[TMP15]]
 ; RV64-UF2-NEXT:    [[TMP18:%.*]] = getelementptr inbounds float, ptr [[TMP17]], i64 [[TMP16]]
 ; RV64-UF2-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP14]], align 4
-; RV64-UF2-NEXT:    [[REVERSE:%.*]] = call <vscale x 4 x float> @llvm.vector.reverse.nxv4f32(<vscale x 4 x float> [[WIDE_LOAD]])
 ; RV64-UF2-NEXT:    [[WIDE_LOAD1:%.*]] = load <vscale x 4 x float>, ptr [[TMP18]], align 4
+; RV64-UF2-NEXT:    [[REVERSE:%.*]] = call <vscale x 4 x float> @llvm.vector.reverse.nxv4f32(<vscale x 4 x float> [[WIDE_LOAD]])
 ; RV64-UF2-NEXT:    [[REVERSE2:%.*]] = call <vscale x 4 x float> @llvm.vector.reverse.nxv4f32(<vscale x 4 x float> [[WIDE_LOAD1]])
 ; RV64-UF2-NEXT:    [[TMP19:%.*]] = fadd <vscale x 4 x float> [[REVERSE]], splat (float 1.000000e+00)
 ; RV64-UF2-NEXT:    [[TMP20:%.*]] = fadd <vscale x 4 x float> [[REVERSE2]], splat (float 1.000000e+00)
@@ -386,8 +386,8 @@ define void @vector_reverse_f32(ptr noalias %A, ptr noalias %B) {
 ; RV64-UF2-NEXT:    [[TMP28:%.*]] = getelementptr inbounds float, ptr [[TMP21]], i64 [[TMP26]]
 ; RV64-UF2-NEXT:    [[TMP29:%.*]] = getelementptr inbounds float, ptr [[TMP28]], i64 [[TMP27]]
 ; RV64-UF2-NEXT:    [[REVERSE3:%.*]] = call <vscale x 4 x float> @llvm.vector.reverse.nxv4f32(<vscale x 4 x float> [[TMP19]])
-; RV64-UF2-NEXT:    store <vscale x 4 x float> [[REVERSE3]], ptr [[TMP25]], align 4
 ; RV64-UF2-NEXT:    [[REVERSE4:%.*]] = call <vscale x 4 x float> @llvm.vector.reverse.nxv4f32(<vscale x 4 x float> [[TMP20]])
+; RV64-UF2-NEXT:    store <vscale x 4 x float> [[REVERSE3]], ptr [[TMP25]], align 4
 ; RV64-UF2-NEXT:    store <vscale x 4 x float> [[REVERSE4]], ptr [[TMP29]], align 4
 ; RV64-UF2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
 ; RV64-UF2-NEXT:    [[TMP30:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll b/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll
index dd8b7d6ea7e42..6d49a7fc16ad5 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll
@@ -105,10 +105,12 @@ define void @vector_reverse_i64(ptr nocapture noundef writeonly %A, ptr nocaptur
 ; CHECK-NEXT:      CLONE ir<%arrayidx> = getelementptr inbounds ir<%B>, ir<%idxprom>
 ; CHECK-NEXT:      vp<%9> = vector-end-pointer inbounds ir<%arrayidx>, vp<%0>
 ; CHECK-NEXT:      WIDEN ir<%1> = load vp<%9>
-; CHECK-NEXT:      WIDEN ir<%add9> = add ir<%1>, ir<1>
+; CHECK-NEXT:      EMIT vp<%10> = reverse ir<%1>
+; CHECK-NEXT:      WIDEN ir<%add9> = add vp<%10>, ir<1>
 ; CHECK-NEXT:      CLONE ir<%arrayidx3> = getelementptr inbounds ir<%A>, ir<%idxprom>
-; CHECK-NEXT:      vp<%10> = vector-end-pointer inbounds ir<%arrayidx3>, vp<%0>
-; CHECK-NEXT:      WIDEN store vp<%10>, ir<%add9>
+; CHECK-NEXT:      vp<%11> = vector-end-pointer inbounds ir<%arrayidx3>, vp<%0>
+; CHECK-NEXT:      EMIT vp<%12> = reverse ir<%add9>
+; CHECK-NEXT:      WIDEN store vp<%11>, vp<%12>
 ; CHECK-NEXT:      EMIT vp<%index.next> = add nuw vp<%6>, vp<%1>
 ; CHECK-NEXT:      EMIT branch-on-count vp<%index.next>, vp<%2>
 ; CHECK-NEXT:    No successors
@@ -167,8 +169,10 @@ define void @vector_reverse_i64(ptr nocapture noundef writeonly %A, ptr nocaptur
 ; CHECK-NEXT:  LV(REG): At #9 Interval # 3
 ; CHECK-NEXT:  LV(REG): At #10 Interval # 3
 ; CHECK-NEXT:  LV(REG): At #11 Interval # 3
-; CHECK-NEXT:  LV(REG): At #12 Interval # 2
-; CHECK-NEXT:  LV(REG): At #13 Interval # 2
+; CHECK-NEXT:  LV(REG): At #12 Interval # 3
+; CHECK-NEXT:  LV(REG): At #13 Interval # 3
+; CHECK-NEXT:  LV(REG): At #14 Interval # 2
+; CHECK-NEXT:  LV(REG): At #15 Interval # 2
 ; CHECK-NEXT:  LV(REG): VF = vscale x 4
 ; CHECK-NEXT:  LV(REG): Found max...
[truncated]

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

lukel97

+1 on splitting this out, I think this works well with the direction of splitting up big recipes into smaller ones. Just an idea about possibly inserting the reverses in tryToWiden but otherwise generally LGTM

lukel97 · 2025-07-17T11:44:21Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

This is a cool optimisation, I guess the LICM transform pulls the VPInstruction::Reverse out of the loop body so convertToEVLRecipes doesn't see it?

Yes. But I think it's fine that this isn't converted into vp.reverse here, since the operand of reverse that can be hoisted by LICM should be uniform. We could even remove the reverse operation entirely in this case in the future.

lukel97 · 2025-07-17T12:11:15Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

I just did a quick scan and it looks like the only places where reverse is set is in VPRecipeBuilder::tryToWidenMemory. Instead of introducing another transform, should we just insert the VPInstruction::Reverses there to avoid having to iterate over all the recipes again?

This way would mean we could also remove the Reverse field from VPWidenMemoryRecipe

I was initially worried this might affect interleaved accesses, but I was overthinking it. So far, generating the reverse directly in tryToWidenMemory seems fine.
9d6136b7a21a91fe5c479b9071c113b7802f062f

This way would mean we could also remove the Reverse field from VPWidenMemoryRecipe

We can't remove the Reverse field from VPWidenMemoryRecipe. We still need the reverse mask if it's a reverse access. I also don’t plan to separate the reverse mask either, as I think it would not bring benefit.

Can we also just reverse the mask too at construction?

That’s possible, but the reason I’m not doing it for now is that, in case some reverse operations can't be eliminated, we might want to convert reverse accesses into strided accesses with a stride of -1. Keeping the Reverse field could make it easier to identify the target recipes that need conversion. Also, reverse masks generally can not do permutation elimination, I think. So that’s why I haven’t done it this way.

What I guess I would eventually like to see is that our optimisations to remove the header masks just become plain old peepholes, written pattern match style like in similarRecipes.

E.g. for a regular load we would try and match:

(load ptr, header-mask) -> (vp.load ptr, all-true, evl)

And if we were to split out the reverses for both the data and mask into separate recipes, and add a VF operand like what we currently do for the end pointer, we would also have:

(reverse (load (end-ptr p, vf), (reverse header-mask, vf)), vf) -> (reverse (vp.load (end-ptr p, evl), all-true, evl), evl) OR (vp.strided-load p, all-true, stride=-1, evl)

I think these patterns seem simple enough, and we could probably write them with the VPlanPatternMatch. Most importantly, these transformations don't change the semantics and are just optimisations. So if somehow we miss one of these transforms it will still be correct.

For permutation elimination we would just need to have:

(reverse (reverse x N) N) -> x

Reverse access can indeed be pattern-matched.
1392a872f0b69310340088d6323f1ae3735838c6
I’ve separated out the reverse mask, but this is not easy as we thought. :(
In addition to affecting the cost model since the cost of the reverse mask is currently not computed. In EVL lowering, it also require extra handling for the reverse mask.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

fhahn · 2025-07-31T09:31:08Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Now that the load/store doesn't handle reversing, it should not need the flag to indicate it is reversing

There was a similar discussion earlier: #146525 (comment)
I think it would be good to continue the discussion in the same comment thread.

fhahn · 2025-07-31T09:34:24Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

The code structure in the function seems inconsistent; converting FirstOrderRecurrenceSplice is handled inline, while handling Reverse is handled in the function. Can we merge the loops, as now we need to unconditionally iterate over all recipes in the loop region anyways?

Same, redirect to #146525 (comment).

@lukel97 @fhahn
7c2493d
The conversion to vp.reverse now uses the new approach. It is no necessary to visit every recipe.

github-actions · 2025-08-14T15:00:02Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll

fhahn · 2025-08-14T15:27:47Z

llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll

now not vectorized any longer?

Yes, also caused by separating the reverse mask from reverse access recipes.
The change is moved to #155579, and will investigate it.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Mel-Chen · 2025-11-20T09:41:11Z

Rebased and ping. There are currently two points that may require further discussion:

When converting to vp.reverse, should we eagerly convert all reverse operations, or should we only convert those whose results are used by a vp.load or consumed by a vp.store?
The current implementation uses the latter (more conservative) strategy, but the former would be simpler.
We currently do not convert reverse operations defined outside the loop region into vp.reverse.
Besides the fact that EVL is not available in that region, such reversed stored values should be uniform.
If we think this is unsafe, we could instead move that reverse operation back into the vector loop and still convert it to vp.reverse. Later, this can be cleaned up by a rule like
Reverse(Splat(V)) --> Splat(V)
in SimplifyRecipes to eliminate the unnecessary reverse.

fhahn · 2025-11-21T14:18:25Z

Rebased and ping. There are currently two points that may require further discussion:

1. When converting to vp.reverse, should we eagerly convert all reverse operations, or should we only convert those whose results are used by a vp.load or consumed by a vp.store?
   The current implementation uses the latter (more conservative) strategy, but the former would be simpler.

From #146525 (comment), it sounds like converting them all would be simpler and also valid/correct, if so good to go with the simpler option to start with?

2. We currently do not convert reverse operations defined outside the loop region into vp.reverse.
   Besides the fact that EVL is not available in that region, such reversed stored values should be uniform.
   If we think this is unsafe, we could instead move that reverse operation back into the vector loop and still convert it to vp.reverse. Later, this can be cleaned up by a rule like
   `Reverse(Splat(V)) --> Splat(V)`
   in SimplifyRecipes to eliminate the unnecessary reverse.

EVL would be available in the middle block, but not blocks before the loop region. Do we have such cases? Can we verify? Any reverse hoisted to the loop should have a single-scalar operand, if so we the reverse can be removed independently I think

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

fhahn · 2025-11-26T10:17:48Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+      if (R->getNumUsers() == 0 || R->hasMoreThanOneUniqueUser())
+        return nullptr;
+      return dyn_cast<VPRecipeBase>(*R->user_begin());


can this use VPuser::getSingleUser?

f61eaad
I'm not sure whether SmallVector Users can contain duplicate users. If we want to preserve the original semantics of this code, we should guard it with hasOneUser() before calling getSingleUser().

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

fhahn · 2025-11-26T10:20:38Z

llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-uniform-store.ll

 ; CHECK-NEXT:    [[TMP0:%.*]] = sub i64 3, [[SPEC_SELECT]]
 ; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
 ; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[REVERSE:%.*]] = call <vscale x 2 x i64> @llvm.vector.reverse.nxv2i64(<vscale x 2 x i64> zeroinitializer)


Would be good to add a fold reverse(live-in/single-scalar) -> live-in/single-scalar, as follow-up

Mel-Chen · 2025-11-26T10:21:39Z

Rebased and ping. There are currently two points that may require further discussion:
1. When converting to vp.reverse, should we eagerly convert all reverse operations, or should we only convert those whose results are used by a vp.load or consumed by a vp.store?
   The current implementation uses the latter (more conservative) strategy, but the former would be simpler.
From #146525 (comment), it sounds like converting them all would be simpler and also valid/correct, if so good to go with the simpler option to start with?

Sure, revert to the original approach: directly convert all reverses in the vector loop region to vp.reverse, and leave a TODO to document a more complex conversion method.

2. We currently do not convert reverse operations defined outside the loop region into vp.reverse.
   Besides the fact that EVL is not available in that region, such reversed stored values should be uniform.
   If we think this is unsafe, we could instead move that reverse operation back into the vector loop and still convert it to vp.reverse. Later, this can be cleaned up by a rule like
   `Reverse(Splat(V)) --> Splat(V)`
   in SimplifyRecipes to eliminate the unnecessary reverse.
EVL would be available in the middle block, but not blocks before the loop region. Do we have such cases? Can we verify? Any reverse hoisted to the loop should have a single-scalar operand, if so we the reverse can be removed independently I think

llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-uniform-store.ll show the issue. The reverse stored value is hoisted to preheader since the stored value is all zero. I think we can optimize it in SimplifyRecipes after this patch.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

fhahn · 2025-12-01T10:04:52Z

llvm/lib/Transforms/Vectorize/VPlanValue.h

+  /// Returns true if the value has exactly one unique user, ignoring multiple
+  /// uses by the same user.
+  bool hasOneUser() const {
+    if (getNumUsers() == 0)
+      return false;
+    if (hasOneUse())
+      return true;
+    return std::equal(std::next(user_begin()), user_end(), user_begin());
+  }
+


is this needed?

We need this so that we won’t change the original semantics of this code.
However, I think this is a separate issue, so we can discuss it in #170826.

lukel97 · 2025-12-01T10:54:15Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

      }
+
+      // TODO: Only convert reverse to vp.reverse if it uses the result of
+      // vp.load, or defines the stored value of vp.store.


Unconditionally replacing all the reverses to vp.reverses here means that optimizeMaskToEVL is no longer correct:

if (match(&CurRecipe, m_MaskedLoad(m_VPValue(EndPtr), m_RemoveMask(HeaderMask, Mask))) && match(EndPtr, m_VecEndPtr(m_VPValue(Addr), m_Specific(&Plan->getVF()))) && cast<VPWidenLoadRecipe>(CurRecipe).isReverse()) return new VPWidenLoadEVLRecipe(cast<VPWidenLoadRecipe>(CurRecipe), AdjustEndPtr(EndPtr), EVL, Mask);

Previously if we had a masked load like

headerMask: 11110000 VPWidenLoadRecipe, reverse=true: xxxxabcd

We would now generate

VPWidenLoadEVLRecipe, EVL=4, reverse=true: abcdxxxx

I.e. the elements aren't shifted. The test diff in this PR happens to be fine because we then unconditionally replace all reverses with vp.reverse.

I'd really like to avoid unconditionally replacing recipes as it makes the EVL transformation error prone and hard to follow, and this would undo what #155394 fixed. We shouldn't need to transform any recipes asides from those that use the IV step.

Can we do the reverse -> vp.reverse transform in optimizeMaskToEVLRecipes alongside the load/store transforms instead?

The issues you mentioned in #146525 (comment) can be fixed when we go to implement the permutation elimination by rewriting the transform in terms of slides instead of reverses.

Unconditionally replacing all the reverses to vp.reverses here means that optimizeMaskToEVL is no longer correct:

I think that's kind-of already the case, right? Until all recipes are converted, the intermediate VPlan may be partially incorrect.

For both approaches correctness boils down to whether the reverse is tied to the load/store currently. I would like to avoid correctness to depend on the exact position of the reverse. We could move the vp.reverse introduction after EVL load/store recipe creation, then asserting that the only operand/user are EVL load/store recipes.

This makes me wonder if it would be easier to avoid adding them up front, but only 'materialize' the reverse operations after EVL recipes have been introduced (then we can convert VPWidenLoadR ,reverse = true- > vp.reverse(VPWidenLoadEVLR...) atomically.

And then just separately materialize reverse() for plain VPWidenLoad/StoreR separately, possibly for now just in convertToConcreteRecipes.

I think that's kind-of already the case, right? Until all recipes are converted, the intermediate VPlan may be partially incorrect.

Only the FOR recipes need to be converted, everything else in optimizeMaskToEVL is just an optimisation to replace the header mask with VP intrinsics. We can skip optimizeMaskToEVL and the VPlan will still be correct, because the masked recipes still have the same semantics.

The fact that we're intertwining the transformation to a variably stepping IV with optimisations makes the EVL transform harder to reason about. This is what #166164 aims to fix by moving out optimizeMaskToEVL.

I would like to avoid correctness to depend on the exact position of the reverse.

I'm with you here, but extracting the reverse from VPWidenLoadRecipe doesn't actually require us to "fix up" anything in the EVL plan for correctness, which is why I find this part of the code confusing.

It's just that the existing optimization becomes incorrect because the semantics of VPWidenLoadRecipe have changed, which we need to adjust in optimizeMaskToEVL. I'll see if I can share a branch that shows what I mean better.

Unconditionally replacing all the reverses to vp.reverses here means that optimizeMaskToEVL is no longer correct:

if (match(&CurRecipe, m_MaskedLoad(m_VPValue(EndPtr), m_RemoveMask(HeaderMask, Mask))) && match(EndPtr, m_VecEndPtr(m_VPValue(Addr), m_Specific(&Plan->getVF()))) && cast<VPWidenLoadRecipe>(CurRecipe).isReverse()) return new VPWidenLoadEVLRecipe(cast<VPWidenLoadRecipe>(CurRecipe), AdjustEndPtr(EndPtr), EVL, Mask);

Previously if we had a masked load like

headerMask: 11110000 VPWidenLoadRecipe, reverse=true: xxxxabcd

We would now generate

VPWidenLoadEVLRecipe, EVL=4, reverse=true: abcdxxxx

I.e. the elements aren't shifted. The test diff in this PR happens to be fine because we then unconditionally replace all reverses with vp.reverse.

Yes, there will be a brief moment where the semantics are temporarily incorrect when EVL lowering.

I'd really like to avoid unconditionally replacing recipes as it makes the EVL transformation error prone and hard to follow, and this would undo what #155394 fixed. We shouldn't need to transform any recipes asides from those that use the IV step.

Can we do the reverse -> vp.reverse transform in optimizeMaskToEVLRecipes alongside the load/store transforms instead?

The issues you mentioned in #146525 (comment) can be fixed when we go to implement the permutation elimination by rewriting the transform in terms of slides instead of reverses.

We could convert reverse accesses into Splice(VPWidenLoadEVLRecipe(VecEndPtr(ptr, evl)), poison, -evl) inside optimizeMaskToEVLRecipes, and rely on the regular reverse rather than vp.reverse. The only concern is that, if we introduce simplification rules that can eliminate the reverse, there will be a temporary performance regression because the reverse access might not be lowered into VPWidenLoadEVLRecipe/VPWidenStoreEVLRecipe. However, correctness should not be affected.

If a short-term performance loss is acceptable before we allow the third operand of splice to take a variable, then I think it’s reasonable to perform this transformation directly in optimizeMaskToEVLRecipes.

Unconditionally replacing all the reverses to vp.reverses here means that optimizeMaskToEVL is no longer correct:

I think that's kind-of already the case, right? Until all recipes are converted, the intermediate VPlan may be partially incorrect.

No strong opinion. I can accept the temporary semantic mismatch, though of course it would be better if we could avoid it altogether.

For both approaches correctness boils down to whether the reverse is tied to the load/store currently. I would like to avoid correctness to depend on the exact position of the reverse. We could move the vp.reverse introduction after EVL load/store recipe creation, then asserting that the only operand/user are EVL load/store recipes.

This makes me wonder if it would be easier to avoid adding them up front, but only 'materialize' the reverse operations after EVL recipes have been introduced (then we can convert VPWidenLoadR ,reverse = true- > vp.reverse(VPWidenLoadEVLR...) atomically.

And then just separately materialize reverse() for plain VPWidenLoad/StoreR separately, possibly for now just in convertToConcreteRecipes.

I hope we can extract the reverse operations earlier. This would not only make it easier to eliminate redundant reverses, but also give us the opportunity to convert remaining reverse accesses into strided accesses with a -1 stride. Mask reverses might indeed be more suitable to extract later in executePlan, since the cost model currently does not account for the cost of mask reverses. Doing it this way avoids affecting the middle-stage cost estimations.

Hmm, maybe I should reopen #123923 and do some modification.

We could convert reverse accesses into Splice(VPWidenLoadEVLRecipe(VecEndPtr(ptr, evl)), poison, -evl) inside optimizeMaskToEVLRecipes, and rely on the regular reverse rather than vp.reverse.

Yup, this is what I had in mind: 3250467

This approach is safer and easier to reason about since the semantics of the VPlan never change.

The only concern is that, if we introduce simplification rules that can eliminate the reverse, there will be a temporary performance regression because the reverse access might not be lowered into VPWidenLoadEVLRecipe/VPWidenStoreEVLRecipe. However, correctness should not be affected.

I can work on generalising the transform to be in terms of splices and not reverses to avoid the regression when the reverses are eliminated. I don't think that should block this PR, I'm happy to iterate on this in tree.

Btw I've posted an RFC to relax the requirements on the splice intrinsic, I will try to push that through.

Mel-Chen · 2025-12-10T10:31:20Z

Ping. Currently, there are three different approaches for converting reverse to vp.reverse. Can we decide which approach we want to move forward with for now?

lukel97 · 2025-12-10T11:36:16Z

Ping. Currently, there are three different approaches for converting reverse to vp.reverse. Can we decide which approach we want to move forward with for now?

I don't see any way forward other than doing it optimizeMaskToEVL, i.e. in 3250467. Having the VPlan be in an incorrect state temporarily makes things much harder to reason about and is likely to lead to more subtle miscompiles down the line.

After this PR I am confident we can rewrite the reverse -> vp.reverse pattern in terms of slides which should address the issue when two reverses are folded away.

…ed result and stored value

…for loaded result and stored value" This reverts commit 2ba8601.

This reverts commit 293a127.

It may temporarily lose some performance when EVL tail folding.

Mel-Chen · 2025-12-11T08:22:17Z

Let me summarize the three approaches so far:

First approach: Unconditionally convert all reverse in the vectorized loop to vp.reverse. This is simple, but may temporarily produce incorrect semantics during the EVL lowering phase.

Second approach: In optimizeMaskToEVL, convert the reverse operation of reverse stored value to vp.reverse, but for reverse load results, convert them outside of optimizeMaskToEVL by visiting the def-use chain of the result. This may temporarily produce incorrect semantics for reverse load, but not for reverse store. Even if the reverse is later moved or simplified, the reverse load can still be converted to an EVL reverse load, so performance is not affected.

Third approach: Convert all reverse inside optimizeMaskToEVL. There is no temporary incorrect semantics, but if the reverse is later moved or simplified, some reverse load may not be convertible to EVL reverse load. This could temporarily impact tail folding performance by EVL. This can later be improved by replacing vp.reverse with llvm.splice + llvm.reverse.

Essentially, the difference between the second and third approaches lies only in the treatment of reverse load.

I don't see any way forward other than doing it optimizeMaskToEVL, i.e. in 3250467. Having the VPlan be in an incorrect state temporarily makes things much harder to reason about and is likely to lead to more subtle miscompiles down the line.

After this PR I am confident we can rewrite the reverse -> vp.reverse pattern in terms of slides which should address the issue when two reverses are folded away.

Yes, the current patch already uses the third approach. Since all three approaches have their pros and cons, what I want to know now is which approach we should proceed with.

lukel97

LGTM, I only have some nits. Thanks for pushing this through

lukel97 · 2025-12-11T13:17:12Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  VPValue *StoredVal;
+  if (match(&CurRecipe, m_MaskedStore(m_VPValue(EndPtr), m_VPValue(StoredVal),


Can this just match the reverse in the same line so you don't need the separate match on line 2886

Suggested change

VPValue *StoredVal;

if (match(&CurRecipe, m_MaskedStore(m_VPValue(EndPtr), m_VPValue(StoredVal),

if (match(&CurRecipe, m_MaskedStore(m_VPValue(EndPtr), m_Reverse(m_VPValue(ReversedVal)),

lukel97 · 2025-12-11T13:20:03Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+          Intrinsic::experimental_vp_reverse,
+          {ReversedVal, Plan->getTrue(), &EVL},
+          TypeInfo.inferScalarType(ReversedVal), {}, {},
+          cast<VPInstruction>(StoredVal)->getDebugLoc());


I think you can just use CurRecipe's debugloc, since the original reverse is created with the store's debugloc?

Suggested change

cast<VPInstruction>(StoredVal)->getDebugLoc());

CurRecipe.getDebugLoc());

I don't think we have tests with debug location for EVL transform, would be good to add some.

fhahn

LGTM, thanks

fhahn · 2025-12-11T21:57:34Z

llvm/lib/Transforms/Vectorize/VPlan.h

+  VPWidenStoreEVLRecipe(VPWidenStoreRecipe &S, VPValue *Addr,
+                        VPValue *StoredVal, VPValue &EVL, VPValue *Mask)


the only addition here is the stored value, right? can we unify to a single constructor, and clarify that VPWidenStoreRecipe is only used for other fields?

fhahn · 2025-12-11T21:58:22Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+          Intrinsic::experimental_vp_reverse,
+          {ReversedVal, Plan->getTrue(), &EVL},
+          TypeInfo.inferScalarType(ReversedVal), {}, {},
+          cast<VPInstruction>(StoredVal)->getDebugLoc());


I don't think we have tests with debug location for EVL transform, would be good to add some.

Mel-Chen requested review from ElvisWang123, alexey-bataev, ayalz, david-arm, fhahn and preames July 1, 2025 13:22

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Jul 1, 2025

This was referenced Jul 1, 2025

[LV][EVL] Generate negative strided load/store for reversed load/store #123608

Open

[VPlan] Fix first-order splices without header mask not using EVL #146672

Merged

lukel97 reviewed Jul 10, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

Mel-Chen force-pushed the reverse branch from 3c17ec7 to edcd8e2 Compare July 17, 2025 05:47

lukel97 reviewed Jul 17, 2025

View reviewed changes

Mel-Chen force-pushed the reverse branch from edcd8e2 to 19849af Compare July 24, 2025 09:20

alexey-bataev reviewed Jul 24, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

Mel-Chen force-pushed the reverse branch from 19849af to 2c6b5ac Compare July 30, 2025 10:07

lukel97 reviewed Jul 30, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

Mel-Chen force-pushed the reverse branch from 2c6b5ac to f308527 Compare July 31, 2025 09:14

llvmbot added the backend:PowerPC label Jul 31, 2025

fhahn reviewed Jul 31, 2025

View reviewed changes

Mel-Chen force-pushed the reverse branch from f308527 to 70966f7 Compare August 7, 2025 07:42

lukel97 mentioned this pull request Aug 8, 2025

Split VPlanTransforms::addExplicitVectorLength into variable step transformation and header mask optimisation #152541

Closed

Mel-Chen force-pushed the reverse branch from 70966f7 to 1392a87 Compare August 14, 2025 14:57

Mel-Chen force-pushed the reverse branch from 1392a87 to 3b9f77e Compare August 14, 2025 15:17

fhahn reviewed Aug 14, 2025

View reviewed changes

lukel97 mentioned this pull request Aug 18, 2025

[VPlan] EVL transform VPVectorEndPointerRecipe alongisde load/store recipes. NFC #152542

Merged

Mel-Chen force-pushed the reverse branch from dd4f50e to 0d4aa7d Compare November 20, 2025 09:12

Mel-Chen force-pushed the reverse branch from 0d4aa7d to 619e9bf Compare November 26, 2025 09:22

fhahn reviewed Nov 26, 2025

View reviewed changes

Mel-Chen force-pushed the reverse branch from cf1e3fd to b93359e Compare December 1, 2025 07:20

fhahn reviewed Dec 1, 2025

View reviewed changes

lukel97 reviewed Dec 1, 2025

View reviewed changes

Mel-Chen force-pushed the reverse branch 2 times, most recently from 3cd910f to 7b22bdf Compare December 5, 2025 10:09

Mel-Chen added 13 commits December 11, 2025 00:13

[VPlan] Extract reverse operation for reverse accesses

060e5d3

fix TTI::CastContextHint

ccfe30b

Change planContainsAdditionalSimplifications for licm

6e43a13

Change the way to convert reverse operation When EVL lowering.

dadd010

Split the appoarch that transforms the reverse to vp.reverse for load…

142bd1c

…ed result and stored value

Move the code in planContainsAdditionalSimplifications

7ceddb5

Revert "Split the appoarch that transforms the reverse to vp.reverse …

b144394

…for loaded result and stored value" This reverts commit 2ba8601.

Revert "Change the way to convert reverse operation When EVL lowering."

b786ff0

This reverts commit 293a127.

Reuse traversal loop

c1d2eb8

Remove "using namespace llvm::VPlanPatternMatch;"

f1a0443

Intro m_Reverse

eb88d09

nfc, replace m_VPInstruction<VPInstruction::Reverse> with m_Reverse

c81f541

The third conversion method for vp.reverse

766934d

It may temporarily lose some performance when EVL tail folding.

Mel-Chen force-pushed the reverse branch from 7b22bdf to 766934d Compare December 11, 2025 08:20

lukel97 approved these changes Dec 11, 2025

View reviewed changes

fhahn approved these changes Dec 11, 2025

View reviewed changes

		VPValue *StoredVal;
		if (match(&CurRecipe, m_MaskedStore(m_VPValue(EndPtr), m_VPValue(StoredVal),

	cast<VPInstruction>(StoredVal)->getDebugLoc());
	CurRecipe.getDebugLoc());

		VPWidenStoreEVLRecipe(VPWidenStoreRecipe &S, VPValue *Addr,
		VPValue StoredVal, VPValue &EVL, VPValue Mask)

[VPlan] Extract reverse operation for reverse accesses #146525

Are you sure you want to change the base?

[VPlan] Extract reverse operation for reverse accesses #146525

Uh oh!

Conversation

Mel-Chen commented Jul 1, 2025

Uh oh!

llvmbot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mel-Chen commented Nov 20, 2025

Uh oh!

fhahn commented Nov 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mel-Chen commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Jul 1, 2025 •

edited

Loading

github-actions bot commented Aug 14, 2025 •

edited

Loading