The matmul result is obviously wrong. Then I execute the next cell with (512,512), which produce a correct result surprisingly.
I suspect there are some bug with the index or missing guard clause. See another example with (16,16), which match the batch size also passed.
p.s. I do not have a GPU and running this on a Mac so everything is run in CPU mode and TRITON_INTERPRET=1