Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Nov 18, 2025

Similar optimizations to #1201 but for AVX

constexpr batch_bool_constant<uint32_t, A, (V0 >= 4), (V1 >= 4), (V2 >= 4), (V3 >= 4), (V4 >= 4), (V5 >= 4), (V6 >= 4), (V7 >= 4)> lane_mask {};
// select lane by the mask index divided by 4
constexpr auto lane = batch_constant<uint32_t, A, 0, 0, 0, 0, 1, 1, 1, 1> {};
constexpr int lane_idx = ((mask / make_batch_constant<uint32_t, 4, A>()) != lane).mask();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have difficulties seeing how the former lane_mask = V_i >= 4 is equivalent to V_i / 4 != lane[i].

Why isn't that just lane_mask >= make_batch_constant<uint32_t, 4, A>() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because r0 and r1 do not contain the same values as before:

  • before: r0 contains items from low in both lanes and r1 contains items from high in both lanes
  • after: each r0 lane contains items from its lane while each r1 lane contains items from the other lane.

For instance, before a 0 in the second lane must be selected from r0 (low values) while after it must be selected from r1 (other lane).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serge-sans-paille is tis OK for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because r0 and r1 do not contain the same values as before:

* before: `r0` contains items from low in both lanes and `r1` contains items from high in both lanes

* after: each `r0` lane contains items from its lane while each  `r1` lane contains items from the other lane.

For instance, before a 0 in the second lane must be selected from r0 (low values) while after it must be selected from r1 (other lane).

and this saves a few permute, perfect!

constexpr batch_bool_constant<uint64_t, A, (V0 >= 2), (V1 >= 2), (V2 >= 2), (V3 >= 2)> blend_mask;
// select lane by the mask index divided by 2
constexpr auto lane = batch_constant<uint64_t, A, 0, 0, 1, 1> {};
constexpr int lane_idx = ((mask / make_batch_constant<uint64_t, 2, A>()) != lane).mask();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@serge-sans-paille serge-sans-paille merged commit 6ebf925 into xtensor-stack:master Nov 20, 2025
60 checks passed
@AntoinePrv AntoinePrv deleted the swizzle-avx branch November 21, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants