Skip to content

Conversation

@chengjunlu
Copy link
Contributor

@chengjunlu chengjunlu commented Aug 11, 2025

To use the transpose 2d block io to load column major matrix from global memory. (The column major matrix here could be generalized to the cases that register layout fast change dim is not same as the fast change dim on global memory.)

The transposing operation is a recursive operation:
image

To use the transpose 2D block IO to load column major matrix on Xe+:

  1. To load the matrix as d32 type matrix from memory with transposed in register.
  2. To transpose the 1xNxd32 to (32/m)xNxdm with the bitcast operation.

The code is only implemented for functionality for the layouts with limitations.
It is not best efficient for now.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This draft PR implements transpose 2D block load functionality to efficiently load column major matrices from global memory on Intel Xe+ GPUs. The implementation introduces a transpose operation when the register layout's fast-changing dimension differs from the memory layout, using d32 type matrices with bitcast operations for the transformation.

  • Added support for transpose 2D block IO operations with transpose parameter
  • Enhanced block IO tile size calculation to handle transpose scenarios
  • Implemented new test coverage for transpose and column major load operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
LoadStoreOpToLLVM.cpp Major refactoring of 2D block load implementation to support transpose operations and simplified layout handling
tensor-pointer-load-block-2d.mlir Updated test expectations for new block load configurations and tile sizes
test_block_store.py Added transpose parameter and column major test cases for block operations

@chengjunlu chengjunlu force-pushed the chengjun/trans_2d_load branch from efff84d to 55c896e Compare August 11, 2025 07:42
@etiotto etiotto marked this pull request as draft October 9, 2025 14:09
@chengjunlu chengjunlu force-pushed the chengjun/trans_2d_load branch from 20a1637 to 942ca37 Compare November 4, 2025 04:49
@chengjunlu chengjunlu changed the title [Draft] Transpose 2d load. [LoadStoreOpToLLVM] Transpose 2d load. Nov 4, 2025
@chengjunlu chengjunlu marked this pull request as ready for review November 4, 2025 04:50
@chengjunlu chengjunlu force-pushed the chengjun/trans_2d_load branch 7 times, most recently from 210886e to e979428 Compare November 10, 2025 05:37
packedElemSizeInBits = 32;
numPackedVals = packedElemSizeInBits / elemSizeInBits;

// Improve this. The current 2D block load only transposes the matrix at
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvements will be added in another PR to minimal the changes in a single PR.

@chengjunlu chengjunlu requested a review from Copilot November 10, 2025 05:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chengjunlu
Copy link
Contributor Author

@whitneywhtsang @etiotto , The transpose loading is ready for review.

@chengjunlu chengjunlu force-pushed the chengjun/trans_2d_load branch from e979428 to 248ae4c Compare November 12, 2025 03:00
Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
@whitneywhtsang
Copy link
Contributor

Can you fix the typo in the image of the PR description or remove it?

return axisInfo ? axisInfo->getStride(dim) : -1;
if (axisInfo) {
const SmallVector<int64_t> &stride = axisInfo->getStride();
if (dim < stride.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we call getStride with dim more than the size of stride?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[06-fused-attention] Determine if FP8 operand B can use 2d block load

3 participants