-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[AMD] Support TDM store on gfx1250 #8392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td
Outdated
Show resolved
Hide resolved
auto [group0, group1] = LLVM::AMD::createTDMDescriptor( | ||
rewriter, loc, getTypeConverter(), elementType, blockShape, tensorShape, | ||
tensorStride, offset, srcPtr, dstPtr, op.getPred(), numWraps, | ||
padInterval, padAmount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we do that when we create the descriptor instead of doing it on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to know the shared memory destination and offset from the load/store op to create a complete tdm descriptor.
It is possible to create a partially filled tdm descriptor when lowering the create tensor descriptor op and then update it later for load/store, but this turns out still require some effort (specifically we need to extract and update global memory pointer, shared memory pointer and tensor shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tensor shape and smem layout should be known at descriptor creation time. The problem is that the offsets need to be combined with the base global address?
I think separating the invariant part of the descriptor will be important to make sure we don't create code sequence accidentally loop dependent.
Also this solution will implicitly rely on LICM and CSE cleaning things up, I think this is a bit risky as in some cases we need to disable LICM to avoid register pressure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it does depend on LICM to clean up. I can switch to do it in the make tensor descriptor if so. @antiagainst Do you have other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there is no downsides I think it would be better. It would also reduce liveranges right? as I assume the descriptor is more packed than the set of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah what Thomas said makes sense to me. And expect it to provide better control and generate better code too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gluon changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. @ThomasRaoux can you take another look?
Type globalPtrTy = ptr_ty(ctx, 1); | ||
|
||
Value globalAddrLow = group0[2]; | ||
Value globalAddrHigh = b.and_(group0[3], b.i32_val(0x7FFFFFFF)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As chatted internally, this is okay to get started for now. But we may want to define proper struct to avoid such i64 pack/unpack if LLVM is confused about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return {group0, group1}; | ||
} | ||
|
||
void fillTDMDescriptor(RewriterBase &rewriter, Location loc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocking comment but at some point I would like to understand why we need to some much arithmetic for every load/store and how expensive it is.
This PR adds support for TDM store on gfx1250, following #8333. Exposes
tdm.async_load
through Gluon. Groups common TDM utilities for load/store.