Skip to content

GEMM results mismatch #467

@stepi9

Description

@stepi9

Description

Hi,
I am working on a deep learning project, where I need to use an AMD-GPU-based cluster to train the model. I am using jax, flax and optax as the main machine learning framework. When I run the python code on my personal computer, with an NVIDIA GPU, I do not receive any errors. However, when I try to run the same code on the AMD GPU on the cluster, I get the warning regarding the GEMM mismatch.

I have compiled, from source, the rocm-jax library with the latest tag (v0.6.0), and I have installed the package in a python venv. I have also tried with the official rocm/jax-community Docker image (via Apptainer), but I get the same messages.

The issue is not simply due to my code, as the same messages arise when I try to run one of Flax's example scripts (specifically https://github.com/google/flax/blob/main/examples/nnx_toy_examples/05_vae.py). When I try to run the code, I get:

E0611 14:26:23.597251  451832 buffer_comparator.cc:145] Difference at 9344: 205.843, expected 183.143
E0611 14:26:23.597299  451832 buffer_comparator.cc:145] Difference at 15892: 210.321, expected 187.659
E0611 14:26:23.597305  451832 buffer_comparator.cc:145] Difference at 15938: 199.099, expected 178.016
2025-06-11 14:26:23.597314: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.669495  451832 buffer_comparator.cc:145] Difference at 7792: 185.703, expected 208.131
E0611 14:26:23.669535  451832 buffer_comparator.cc:145] Difference at 12102: 206.548, expected 183.614
E0611 14:26:23.669549  451832 buffer_comparator.cc:145] Difference at 15948: 218.747, expected 194.395
2025-06-11 14:26:23.669557: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.682719  451832 buffer_comparator.cc:145] Difference at 1025: 10.3105, expected 8.15896
E0611 14:26:23.682732  451832 buffer_comparator.cc:145] Difference at 1029: 9.24706, expected 7.75471
E0611 14:26:23.682736  451832 buffer_comparator.cc:145] Difference at 1031: 10.8553, expected 9.51487
E0611 14:26:23.682739  451832 buffer_comparator.cc:145] Difference at 1032: 8.61683, expected 7.628
E0611 14:26:23.682743  451832 buffer_comparator.cc:145] Difference at 1035: 10.0826, expected 8.13794
E0611 14:26:23.682747  451832 buffer_comparator.cc:145] Difference at 1036: 9.36097, expected 8.08674
E0611 14:26:23.682750  451832 buffer_comparator.cc:145] Difference at 1038: 7.40636, expected 5.49972
E0611 14:26:23.682755  451832 buffer_comparator.cc:145] Difference at 1041: 9.11373, expected 8.06417
E0611 14:26:23.682758  451832 buffer_comparator.cc:145] Difference at 1044: 9.96564, expected 6.78432
E0611 14:26:23.682761  451832 buffer_comparator.cc:145] Difference at 1047: 10.3575, expected 8.61353
2025-06-11 14:26:23.682768: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.739910  451832 buffer_comparator.cc:145] Difference at 3137: 40.0469, expected 34.1624
E0611 14:26:23.739965  451832 buffer_comparator.cc:145] Difference at 3138: 34.1857, expected 29.3643
E0611 14:26:23.739970  451832 buffer_comparator.cc:145] Difference at 3139: 33.6209, expected 29.6085
E0611 14:26:23.739975  451832 buffer_comparator.cc:145] Difference at 3140: 33.4683, expected 29.7014
E0611 14:26:23.739982  451832 buffer_comparator.cc:145] Difference at 3143: 36.3527, expected 31.76
E0611 14:26:23.739986  451832 buffer_comparator.cc:145] Difference at 3144: 36.3201, expected 31.9039
E0611 14:26:23.740009  451832 buffer_comparator.cc:145] Difference at 3145: 37.0308, expected 32.787
E0611 14:26:23.740015  451832 buffer_comparator.cc:145] Difference at 3146: 33.6818, expected 29.4154
E0611 14:26:23.740021  451832 buffer_comparator.cc:145] Difference at 3150: 41.9207, expected 37.5318
E0611 14:26:23.740027  451832 buffer_comparator.cc:145] Difference at 3151: 33.3773, expected 28.8064
2025-06-11 14:26:23.740039: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.748676  451832 buffer_comparator.cc:145] Difference at 256: 33.6314, expected 29.6305
E0611 14:26:23.748690  451832 buffer_comparator.cc:145] Difference at 257: 35.9533, expected 31.5235
E0611 14:26:23.748695  451832 buffer_comparator.cc:145] Difference at 258: 34.0687, expected 29.0571
E0611 14:26:23.748702  451832 buffer_comparator.cc:145] Difference at 260: 34.1512, expected 30.1305
E0611 14:26:23.748709  451832 buffer_comparator.cc:145] Difference at 261: 35.5173, expected 31.2248
E0611 14:26:23.748714  451832 buffer_comparator.cc:145] Difference at 263: 37.6513, expected 33.6381
E0611 14:26:23.748720  451832 buffer_comparator.cc:145] Difference at 265: 34.4225, expected 30.5112
E0611 14:26:23.748724  451832 buffer_comparator.cc:145] Difference at 267: 40.5426, expected 36.3429
E0611 14:26:23.748729  451832 buffer_comparator.cc:145] Difference at 268: 35.4782, expected 30.2271
E0611 14:26:23.748736  451832 buffer_comparator.cc:145] Difference at 269: 34.3253, expected 30.6183
2025-06-11 14:26:23.748744: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.757706  451832 buffer_comparator.cc:145] Difference at 1025: 18.6348, expected 16.5961
E0611 14:26:23.757719  451832 buffer_comparator.cc:145] Difference at 1026: 15.6877, expected 13.5831
E0611 14:26:23.757724  451832 buffer_comparator.cc:145] Difference at 1038: 17.7971, expected 14.9409
E0611 14:26:23.757730  451832 buffer_comparator.cc:145] Difference at 1041: 18.4966, expected 16.4885
E0611 14:26:23.757736  451832 buffer_comparator.cc:145] Difference at 1043: 16.4748, expected 14.5351
E0611 14:26:23.757743  451832 buffer_comparator.cc:145] Difference at 1045: 15.9621, expected 14.029
E0611 14:26:23.757748  451832 buffer_comparator.cc:145] Difference at 1048: 16.9931, expected 14.4038
E0611 14:26:23.757755  451832 buffer_comparator.cc:145] Difference at 1050: 19.0769, expected 16.5237
E0611 14:26:23.757761  451832 buffer_comparator.cc:145] Difference at 1057: 14.1745, expected 12.4159
E0611 14:26:23.757766  451832 buffer_comparator.cc:145] Difference at 1064: 15.1762, expected 13.3455
2025-06-11 14:26:23.757774: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.765509  451832 buffer_comparator.cc:145] Difference at 384: 70.1718, expected 61.9522
E0611 14:26:23.765522  451832 buffer_comparator.cc:145] Difference at 394: 69.2255, expected 60.4526
E0611 14:26:23.765527  451832 buffer_comparator.cc:145] Difference at 397: 64.7409, expected 57.3438
E0611 14:26:23.765535  451832 buffer_comparator.cc:145] Difference at 404: 66.496, expected 59.6846
E0611 14:26:23.765540  451832 buffer_comparator.cc:145] Difference at 406: 67.2019, expected 59.6928
E0611 14:26:23.765547  451832 buffer_comparator.cc:145] Difference at 660: 71.3602, expected 63.6354
E0611 14:26:23.765554  451832 buffer_comparator.cc:145] Difference at 718: 64.1682, expected 71.8248
E0611 14:26:23.765563  451832 buffer_comparator.cc:145] Difference at 1691: 67.0263, expected 59.7899
E0611 14:26:23.765569  451832 buffer_comparator.cc:145] Difference at 1764: 59.8937, expected 67.8645
E0611 14:26:23.765576  451832 buffer_comparator.cc:145] Difference at 1765: 63.9905, expected 72.8709
2025-06-11 14:26:23.765583: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.769310  451832 buffer_comparator.cc:145] Difference at 384: 70.1718, expected 61.9522
E0611 14:26:23.769322  451832 buffer_comparator.cc:145] Difference at 394: 69.2255, expected 60.4526
E0611 14:26:23.769327  451832 buffer_comparator.cc:145] Difference at 397: 64.7409, expected 57.3438
E0611 14:26:23.769334  451832 buffer_comparator.cc:145] Difference at 404: 66.496, expected 59.6846
E0611 14:26:23.769339  451832 buffer_comparator.cc:145] Difference at 406: 67.2019, expected 59.6928
E0611 14:26:23.769347  451832 buffer_comparator.cc:145] Difference at 660: 71.3602, expected 63.6354
E0611 14:26:23.769353  451832 buffer_comparator.cc:145] Difference at 718: 64.1682, expected 71.8248
E0611 14:26:23.769363  451832 buffer_comparator.cc:145] Difference at 1691: 67.0263, expected 59.7899
E0611 14:26:23.769368  451832 buffer_comparator.cc:145] Difference at 1764: 59.8937, expected 67.8645
E0611 14:26:23.769373  451832 buffer_comparator.cc:145] Difference at 1765: 63.9906, expected 72.8709
2025-06-11 14:26:23.769380: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.805237  451832 buffer_comparator.cc:145] Difference at 1024: 9.26716, expected 7.62103
E0611 14:26:23.805249  451832 buffer_comparator.cc:145] Difference at 1025: 9.144, expected 8.11683
E0611 14:26:23.805254  451832 buffer_comparator.cc:145] Difference at 1029: 11.8019, expected 10.4442
E0611 14:26:23.805261  451832 buffer_comparator.cc:145] Difference at 1030: 9.69273, expected 7.69553
E0611 14:26:23.805268  451832 buffer_comparator.cc:145] Difference at 1031: 11.2271, expected 9.57831
E0611 14:26:23.805273  451832 buffer_comparator.cc:145] Difference at 1032: 10.0358, expected 8.47069
E0611 14:26:23.805279  451832 buffer_comparator.cc:145] Difference at 1034: 9.16138, expected 7.68747
E0611 14:26:23.805286  451832 buffer_comparator.cc:145] Difference at 1035: 10.7463, expected 9.28415
E0611 14:26:23.805291  451832 buffer_comparator.cc:145] Difference at 1037: 9.79935, expected 8.46282
E0611 14:26:23.805296  451832 buffer_comparator.cc:145] Difference at 1039: 9.90779, expected 8.7388
2025-06-11 14:26:23.805303: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.6.0.dev20250606+4952ad41a
jaxlib: 0.6.0.dev20250609
numpy:  2.3.0
python: 3.11.9 (main, Apr  3 2025, 09:59:09) [GCC 14.2.0]
device info: AMD Instinct MI210-3, 3 local devices"
process_count: 1
platform: uname_result(system='Linux', node='gpunode', release='5.14.0-503.33.1.el9_5.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Mar 19 16:23:31 UTC 2025', machine='x86_64')

I compiled the library with ROCM 6.3.3

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions