-
Notifications
You must be signed in to change notification settings - Fork 281
AVX swizzle broadcast and swap optimization #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
serge-sans-paille
merged 5 commits into
xtensor-stack:master
from
AntoinePrv:swizzle-avx
Nov 20, 2025
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1629,88 +1629,98 @@ namespace xsimd | |
| } | ||
| return split; | ||
| } | ||
| // Duplicate lanes separately | ||
| // 1) duplicate low and high lanes | ||
| __m256 low_dup = _mm256_permute2f128_ps(self, self, 0x00); // [low | low] | ||
| __m256 hi_dup = _mm256_permute2f128_ps(self, self, 0x11); // [high| high] | ||
| constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>(); | ||
| XSIMD_IF_CONSTEXPR(detail::is_only_from_lo(mask)) | ||
| { | ||
| __m256 broadcast = _mm256_permute2f128_ps(self, self, 0x00); // [low | low] | ||
| return _mm256_permutevar_ps(broadcast, lane_mask.as_batch()); | ||
| } | ||
| XSIMD_IF_CONSTEXPR(detail::is_only_from_hi(mask)) | ||
| { | ||
| __m256 broadcast = _mm256_permute2f128_ps(self, self, 0x11); // [high | high] | ||
| return _mm256_permutevar_ps(broadcast, lane_mask.as_batch()); | ||
| } | ||
|
|
||
| // Fallback to general algorithm. This is the same as the dynamic version with the exception | ||
| // that possible operations are done at compile time. | ||
|
|
||
| // swap lanes | ||
| __m256 swapped = _mm256_permute2f128_ps(self, self, 0x01); // [high | low] | ||
|
|
||
| // 2) build lane-local index vector (each element = source_index & 3) | ||
| constexpr batch_constant<uint32_t, A, (V0 % 4), (V1 % 4), (V2 % 4), (V3 % 4), (V4 % 4), (V5 % 4), (V6 % 4), (V7 % 4)> half_mask; | ||
| // normalize mask taking modulo 4 | ||
| constexpr auto half_mask = mask % make_batch_constant<uint32_t, 4, A>(); | ||
|
|
||
| __m256 r0 = _mm256_permutevar_ps(low_dup, half_mask.as_batch()); // pick from low lane | ||
| __m256 r1 = _mm256_permutevar_ps(hi_dup, half_mask.as_batch()); // pick from high lane | ||
| // permute within each lane | ||
| __m256 r0 = _mm256_permutevar_ps(self, half_mask.as_batch()); | ||
| __m256 r1 = _mm256_permutevar_ps(swapped, half_mask.as_batch()); | ||
|
|
||
| constexpr batch_bool_constant<uint32_t, A, (V0 >= 4), (V1 >= 4), (V2 >= 4), (V3 >= 4), (V4 >= 4), (V5 >= 4), (V6 >= 4), (V7 >= 4)> lane_mask {}; | ||
| // select lane by the mask index divided by 4 | ||
| constexpr auto lane = batch_constant<uint32_t, A, 0, 0, 0, 0, 1, 1, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint32_t, 4, A>()) != lane).mask(); | ||
|
|
||
| return _mm256_blend_ps(r0, r1, lane_mask.mask()); | ||
| return _mm256_blend_ps(r0, r1, lane_idx); | ||
| } | ||
|
|
||
| template <class A, uint64_t V0, uint64_t V1, uint64_t V2, uint64_t V3> | ||
| XSIMD_INLINE batch<double, A> swizzle(batch<double, A> const& self, batch_constant<uint64_t, A, V0, V1, V2, V3> mask, requires_arch<avx>) noexcept | ||
| { | ||
| // cannot use detail::mod_shuffle as the mod and shift are different in this case | ||
| constexpr auto imm = ((V0 & 1) << 0) | ((V1 & 1) << 1) | ((V2 & 1) << 2) | ((V3 & 1) << 3); | ||
| XSIMD_IF_CONSTEXPR(detail::is_identity(mask)) { return self; } | ||
| constexpr auto imm = ((V0 % 2) << 0) | ((V1 % 2) << 1) | ((V2 % 2) << 2) | ((V3 % 2) << 3); | ||
| XSIMD_IF_CONSTEXPR(detail::is_identity(mask)) | ||
| { | ||
| return self; | ||
| } | ||
| XSIMD_IF_CONSTEXPR(!detail::is_cross_lane(mask)) | ||
| { | ||
| return _mm256_permute_pd(self, imm); | ||
| } | ||
| // duplicate low and high part of input | ||
| __m256d lo = _mm256_permute2f128_pd(self, self, 0x00); | ||
| __m256d hi = _mm256_permute2f128_pd(self, self, 0x11); | ||
| XSIMD_IF_CONSTEXPR(detail::is_only_from_lo(mask)) | ||
| { | ||
| __m256d broadcast = _mm256_permute2f128_pd(self, self, 0x00); // [low | low] | ||
| return _mm256_permute_pd(broadcast, imm); | ||
| } | ||
| XSIMD_IF_CONSTEXPR(detail::is_only_from_hi(mask)) | ||
| { | ||
| __m256d broadcast = _mm256_permute2f128_pd(self, self, 0x11); // [high | high] | ||
| return _mm256_permute_pd(broadcast, imm); | ||
| } | ||
|
|
||
| // Fallback to general algorithm. This is the same as the dynamic version with the exception | ||
| // that possible operations are done at compile time. | ||
|
|
||
| // swap lanes | ||
| __m256d swapped = _mm256_permute2f128_pd(self, self, 0x01); // [high | low] | ||
|
|
||
| // permute within each lane | ||
| __m256d r0 = _mm256_permute_pd(lo, imm); | ||
| __m256d r1 = _mm256_permute_pd(hi, imm); | ||
| __m256d r0 = _mm256_permute_pd(self, imm); | ||
| __m256d r1 = _mm256_permute_pd(swapped, imm); | ||
|
|
||
| // mask to choose the right lane | ||
| constexpr batch_bool_constant<uint64_t, A, (V0 >= 2), (V1 >= 2), (V2 >= 2), (V3 >= 2)> blend_mask; | ||
| // select lane by the mask index divided by 2 | ||
| constexpr auto lane = batch_constant<uint64_t, A, 0, 0, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint64_t, 2, A>()) != lane).mask(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here |
||
|
|
||
| // blend the two permutes | ||
| return _mm256_blend_pd(r0, r1, blend_mask.mask()); | ||
| } | ||
| template <class A, | ||
| typename T, | ||
| uint32_t V0, | ||
| uint32_t V1, | ||
| uint32_t V2, | ||
| uint32_t V3, | ||
| uint32_t V4, | ||
| uint32_t V5, | ||
| uint32_t V6, | ||
| uint32_t V7, | ||
| detail::enable_sized_integral_t<T, 4> = 0> | ||
| XSIMD_INLINE batch<T, A> swizzle(batch<T, A> const& self, | ||
| batch_constant<uint32_t, A, | ||
| V0, | ||
| V1, | ||
| V2, | ||
| V3, | ||
| V4, | ||
| V5, | ||
| V6, | ||
| V7> const& mask, | ||
| requires_arch<avx>) noexcept | ||
| return _mm256_blend_pd(r0, r1, lane_idx); | ||
| } | ||
|
|
||
| template < | ||
| class A, typename T, | ||
| uint32_t V0, uint32_t V1, uint32_t V2, uint32_t V3, uint32_t V4, uint32_t V5, uint32_t V6, uint32_t V7, | ||
| detail::enable_sized_integral_t<T, 4> = 0> | ||
| XSIMD_INLINE batch<T, A> swizzle( | ||
| batch<T, A> const& self, | ||
| batch_constant<uint32_t, A, V0, V1, V2, V3, V4, V5, V6, V7> const& mask, | ||
| requires_arch<avx>) noexcept | ||
| { | ||
| return bitwise_cast<T>( | ||
| swizzle(bitwise_cast<float>(self), mask)); | ||
| return bitwise_cast<T>(swizzle(bitwise_cast<float>(self), mask)); | ||
| } | ||
|
|
||
| template <class A, | ||
| typename T, | ||
| uint64_t V0, | ||
| uint64_t V1, | ||
| uint64_t V2, | ||
| uint64_t V3, | ||
| detail::enable_sized_integral_t<T, 8> = 0> | ||
| XSIMD_INLINE batch<T, A> | ||
| swizzle(batch<T, A> const& self, | ||
| batch_constant<uint64_t, A, V0, V1, V2, V3> const& mask, | ||
| requires_arch<avx>) noexcept | ||
| template <class A, typename T, uint64_t V0, uint64_t V1, uint64_t V2, uint64_t V3, detail::enable_sized_integral_t<T, 8> = 0> | ||
| XSIMD_INLINE batch<T, A> swizzle(batch<T, A> const& self, batch_constant<uint64_t, A, V0, V1, V2, V3> const& mask, requires_arch<avx>) noexcept | ||
| { | ||
| return bitwise_cast<T>( | ||
| swizzle(bitwise_cast<double>(self), mask)); | ||
| return bitwise_cast<T>(swizzle(bitwise_cast<double>(self), mask)); | ||
| } | ||
|
|
||
| // transpose | ||
| template <class A> | ||
| XSIMD_INLINE void transpose(batch<float, A>* matrix_begin, batch<float, A>* matrix_end, requires_arch<avx>) noexcept | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have difficulties seeing how the former
lane_mask = V_i >= 4is equivalent toV_i / 4 != lane[i].Why isn't that just
lane_mask >= make_batch_constant<uint32_t, 4, A>()?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
r0andr1do not contain the same values as before:r0contains items from low in both lanes andr1contains items from high in both lanesr0lane contains items from its lane while eachr1lane contains items from the other lane.For instance, before a
0in the second lane must be selected fromr0(low values) while after it must be selected fromr1(other lane).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serge-sans-paille is tis OK for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this saves a few permute, perfect!