-
Notifications
You must be signed in to change notification settings - Fork 281
AVX swizzle broadcast and swap optimization #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX swizzle broadcast and swap optimization #1213
Conversation
| constexpr batch_bool_constant<uint32_t, A, (V0 >= 4), (V1 >= 4), (V2 >= 4), (V3 >= 4), (V4 >= 4), (V5 >= 4), (V6 >= 4), (V7 >= 4)> lane_mask {}; | ||
| // select lane by the mask index divided by 4 | ||
| constexpr auto lane = batch_constant<uint32_t, A, 0, 0, 0, 0, 1, 1, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint32_t, 4, A>()) != lane).mask(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have difficulties seeing how the former lane_mask = V_i >= 4 is equivalent to V_i / 4 != lane[i].
Why isn't that just lane_mask >= make_batch_constant<uint32_t, 4, A>() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because r0 and r1 do not contain the same values as before:
- before:
r0contains items from low in both lanes andr1contains items from high in both lanes - after: each
r0lane contains items from its lane while eachr1lane contains items from the other lane.
For instance, before a 0 in the second lane must be selected from r0 (low values) while after it must be selected from r1 (other lane).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serge-sans-paille is tis OK for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
r0andr1do not contain the same values as before:* before: `r0` contains items from low in both lanes and `r1` contains items from high in both lanes * after: each `r0` lane contains items from its lane while each `r1` lane contains items from the other lane.For instance, before a
0in the second lane must be selected fromr0(low values) while after it must be selected fromr1(other lane).
and this saves a few permute, perfect!
| constexpr batch_bool_constant<uint64_t, A, (V0 >= 2), (V1 >= 2), (V2 >= 2), (V3 >= 2)> blend_mask; | ||
| // select lane by the mask index divided by 2 | ||
| constexpr auto lane = batch_constant<uint64_t, A, 0, 0, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint64_t, 2, A>()) != lane).mask(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Similar optimizations to #1201 but for AVX