Skip to content

Conversation

borontion
Copy link
Contributor

@borontion borontion commented Oct 7, 2025

This PR adds support for TDM store on gfx1250, following #8333. Exposes tdm.async_load through Gluon. Groups common TDM utilities for load/store.

@borontion borontion marked this pull request as ready for review October 7, 2025 21:36
Comment on lines 1051 to 1054
auto [group0, group1] = LLVM::AMD::createTDMDescriptor(
rewriter, loc, getTypeConverter(), elementType, blockShape, tensorShape,
tensorStride, offset, srcPtr, dstPtr, op.getPred(), numWraps,
padInterval, padAmount);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we do that when we create the descriptor instead of doing it on the fly?

Copy link
Contributor Author

@borontion borontion Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to know the shared memory destination and offset from the load/store op to create a complete tdm descriptor.

It is possible to create a partially filled tdm descriptor when lowering the create tensor descriptor op and then update it later for load/store, but this turns out still require some effort (specifically we need to extract and update global memory pointer, shared memory pointer and tensor shape

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tensor shape and smem layout should be known at descriptor creation time. The problem is that the offsets need to be combined with the base global address?
I think separating the invariant part of the descriptor will be important to make sure we don't create code sequence accidentally loop dependent.
Also this solution will implicitly rely on LICM and CSE cleaning things up, I think this is a bit risky as in some cases we need to disable LICM to avoid register pressure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does depend on LICM to clean up. I can switch to do it in the make tensor descriptor if so. @antiagainst Do you have other thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there is no downsides I think it would be better. It would also reduce liveranges right? as I assume the descriptor is more packed than the set of values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah what Thomas said makes sense to me. And expect it to provide better control and generate better code too.

Copy link
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gluon changes LGTM

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. @ThomasRaoux can you take another look?

Type globalPtrTy = ptr_ty(ctx, 1);

Value globalAddrLow = group0[2];
Value globalAddrHigh = b.and_(group0[3], b.i32_val(0x7FFFFFFF));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As chatted internally, this is okay to get started for now. But we may want to define proper struct to avoid such i64 pack/unpack if LLVM is confused about it.

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return {group0, group1};
}

void fillTDMDescriptor(RewriterBase &rewriter, Location loc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocking comment but at some point I would like to understand why we need to some much arithmetic for every load/store and how expensive it is.

@antiagainst antiagainst merged commit b5fea1e into triton-lang:main Oct 9, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants