-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description
Hi,
I am working on a deep learning project, where I need to use an AMD-GPU-based cluster to train the model. I am using jax, flax and optax as the main machine learning framework. When I run the python code on my personal computer, with an NVIDIA GPU, I do not receive any errors. However, when I try to run the same code on the AMD GPU on the cluster, I get the warning regarding the GEMM mismatch.
I have compiled, from source, the rocm-jax library with the latest tag (v0.6.0), and I have installed the package in a python venv. I have also tried with the official rocm/jax-community Docker image (via Apptainer), but I get the same messages.
The issue is not simply due to my code, as the same messages arise when I try to run one of Flax's example scripts (specifically https://github.com/google/flax/blob/main/examples/nnx_toy_examples/05_vae.py). When I try to run the code, I get:
E0611 14:26:23.597251 451832 buffer_comparator.cc:145] Difference at 9344: 205.843, expected 183.143
E0611 14:26:23.597299 451832 buffer_comparator.cc:145] Difference at 15892: 210.321, expected 187.659
E0611 14:26:23.597305 451832 buffer_comparator.cc:145] Difference at 15938: 199.099, expected 178.016
2025-06-11 14:26:23.597314: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.669495 451832 buffer_comparator.cc:145] Difference at 7792: 185.703, expected 208.131
E0611 14:26:23.669535 451832 buffer_comparator.cc:145] Difference at 12102: 206.548, expected 183.614
E0611 14:26:23.669549 451832 buffer_comparator.cc:145] Difference at 15948: 218.747, expected 194.395
2025-06-11 14:26:23.669557: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.682719 451832 buffer_comparator.cc:145] Difference at 1025: 10.3105, expected 8.15896
E0611 14:26:23.682732 451832 buffer_comparator.cc:145] Difference at 1029: 9.24706, expected 7.75471
E0611 14:26:23.682736 451832 buffer_comparator.cc:145] Difference at 1031: 10.8553, expected 9.51487
E0611 14:26:23.682739 451832 buffer_comparator.cc:145] Difference at 1032: 8.61683, expected 7.628
E0611 14:26:23.682743 451832 buffer_comparator.cc:145] Difference at 1035: 10.0826, expected 8.13794
E0611 14:26:23.682747 451832 buffer_comparator.cc:145] Difference at 1036: 9.36097, expected 8.08674
E0611 14:26:23.682750 451832 buffer_comparator.cc:145] Difference at 1038: 7.40636, expected 5.49972
E0611 14:26:23.682755 451832 buffer_comparator.cc:145] Difference at 1041: 9.11373, expected 8.06417
E0611 14:26:23.682758 451832 buffer_comparator.cc:145] Difference at 1044: 9.96564, expected 6.78432
E0611 14:26:23.682761 451832 buffer_comparator.cc:145] Difference at 1047: 10.3575, expected 8.61353
2025-06-11 14:26:23.682768: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.739910 451832 buffer_comparator.cc:145] Difference at 3137: 40.0469, expected 34.1624
E0611 14:26:23.739965 451832 buffer_comparator.cc:145] Difference at 3138: 34.1857, expected 29.3643
E0611 14:26:23.739970 451832 buffer_comparator.cc:145] Difference at 3139: 33.6209, expected 29.6085
E0611 14:26:23.739975 451832 buffer_comparator.cc:145] Difference at 3140: 33.4683, expected 29.7014
E0611 14:26:23.739982 451832 buffer_comparator.cc:145] Difference at 3143: 36.3527, expected 31.76
E0611 14:26:23.739986 451832 buffer_comparator.cc:145] Difference at 3144: 36.3201, expected 31.9039
E0611 14:26:23.740009 451832 buffer_comparator.cc:145] Difference at 3145: 37.0308, expected 32.787
E0611 14:26:23.740015 451832 buffer_comparator.cc:145] Difference at 3146: 33.6818, expected 29.4154
E0611 14:26:23.740021 451832 buffer_comparator.cc:145] Difference at 3150: 41.9207, expected 37.5318
E0611 14:26:23.740027 451832 buffer_comparator.cc:145] Difference at 3151: 33.3773, expected 28.8064
2025-06-11 14:26:23.740039: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.748676 451832 buffer_comparator.cc:145] Difference at 256: 33.6314, expected 29.6305
E0611 14:26:23.748690 451832 buffer_comparator.cc:145] Difference at 257: 35.9533, expected 31.5235
E0611 14:26:23.748695 451832 buffer_comparator.cc:145] Difference at 258: 34.0687, expected 29.0571
E0611 14:26:23.748702 451832 buffer_comparator.cc:145] Difference at 260: 34.1512, expected 30.1305
E0611 14:26:23.748709 451832 buffer_comparator.cc:145] Difference at 261: 35.5173, expected 31.2248
E0611 14:26:23.748714 451832 buffer_comparator.cc:145] Difference at 263: 37.6513, expected 33.6381
E0611 14:26:23.748720 451832 buffer_comparator.cc:145] Difference at 265: 34.4225, expected 30.5112
E0611 14:26:23.748724 451832 buffer_comparator.cc:145] Difference at 267: 40.5426, expected 36.3429
E0611 14:26:23.748729 451832 buffer_comparator.cc:145] Difference at 268: 35.4782, expected 30.2271
E0611 14:26:23.748736 451832 buffer_comparator.cc:145] Difference at 269: 34.3253, expected 30.6183
2025-06-11 14:26:23.748744: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.757706 451832 buffer_comparator.cc:145] Difference at 1025: 18.6348, expected 16.5961
E0611 14:26:23.757719 451832 buffer_comparator.cc:145] Difference at 1026: 15.6877, expected 13.5831
E0611 14:26:23.757724 451832 buffer_comparator.cc:145] Difference at 1038: 17.7971, expected 14.9409
E0611 14:26:23.757730 451832 buffer_comparator.cc:145] Difference at 1041: 18.4966, expected 16.4885
E0611 14:26:23.757736 451832 buffer_comparator.cc:145] Difference at 1043: 16.4748, expected 14.5351
E0611 14:26:23.757743 451832 buffer_comparator.cc:145] Difference at 1045: 15.9621, expected 14.029
E0611 14:26:23.757748 451832 buffer_comparator.cc:145] Difference at 1048: 16.9931, expected 14.4038
E0611 14:26:23.757755 451832 buffer_comparator.cc:145] Difference at 1050: 19.0769, expected 16.5237
E0611 14:26:23.757761 451832 buffer_comparator.cc:145] Difference at 1057: 14.1745, expected 12.4159
E0611 14:26:23.757766 451832 buffer_comparator.cc:145] Difference at 1064: 15.1762, expected 13.3455
2025-06-11 14:26:23.757774: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.765509 451832 buffer_comparator.cc:145] Difference at 384: 70.1718, expected 61.9522
E0611 14:26:23.765522 451832 buffer_comparator.cc:145] Difference at 394: 69.2255, expected 60.4526
E0611 14:26:23.765527 451832 buffer_comparator.cc:145] Difference at 397: 64.7409, expected 57.3438
E0611 14:26:23.765535 451832 buffer_comparator.cc:145] Difference at 404: 66.496, expected 59.6846
E0611 14:26:23.765540 451832 buffer_comparator.cc:145] Difference at 406: 67.2019, expected 59.6928
E0611 14:26:23.765547 451832 buffer_comparator.cc:145] Difference at 660: 71.3602, expected 63.6354
E0611 14:26:23.765554 451832 buffer_comparator.cc:145] Difference at 718: 64.1682, expected 71.8248
E0611 14:26:23.765563 451832 buffer_comparator.cc:145] Difference at 1691: 67.0263, expected 59.7899
E0611 14:26:23.765569 451832 buffer_comparator.cc:145] Difference at 1764: 59.8937, expected 67.8645
E0611 14:26:23.765576 451832 buffer_comparator.cc:145] Difference at 1765: 63.9905, expected 72.8709
2025-06-11 14:26:23.765583: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.769310 451832 buffer_comparator.cc:145] Difference at 384: 70.1718, expected 61.9522
E0611 14:26:23.769322 451832 buffer_comparator.cc:145] Difference at 394: 69.2255, expected 60.4526
E0611 14:26:23.769327 451832 buffer_comparator.cc:145] Difference at 397: 64.7409, expected 57.3438
E0611 14:26:23.769334 451832 buffer_comparator.cc:145] Difference at 404: 66.496, expected 59.6846
E0611 14:26:23.769339 451832 buffer_comparator.cc:145] Difference at 406: 67.2019, expected 59.6928
E0611 14:26:23.769347 451832 buffer_comparator.cc:145] Difference at 660: 71.3602, expected 63.6354
E0611 14:26:23.769353 451832 buffer_comparator.cc:145] Difference at 718: 64.1682, expected 71.8248
E0611 14:26:23.769363 451832 buffer_comparator.cc:145] Difference at 1691: 67.0263, expected 59.7899
E0611 14:26:23.769368 451832 buffer_comparator.cc:145] Difference at 1764: 59.8937, expected 67.8645
E0611 14:26:23.769373 451832 buffer_comparator.cc:145] Difference at 1765: 63.9906, expected 72.8709
2025-06-11 14:26:23.769380: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0611 14:26:23.805237 451832 buffer_comparator.cc:145] Difference at 1024: 9.26716, expected 7.62103
E0611 14:26:23.805249 451832 buffer_comparator.cc:145] Difference at 1025: 9.144, expected 8.11683
E0611 14:26:23.805254 451832 buffer_comparator.cc:145] Difference at 1029: 11.8019, expected 10.4442
E0611 14:26:23.805261 451832 buffer_comparator.cc:145] Difference at 1030: 9.69273, expected 7.69553
E0611 14:26:23.805268 451832 buffer_comparator.cc:145] Difference at 1031: 11.2271, expected 9.57831
E0611 14:26:23.805273 451832 buffer_comparator.cc:145] Difference at 1032: 10.0358, expected 8.47069
E0611 14:26:23.805279 451832 buffer_comparator.cc:145] Difference at 1034: 9.16138, expected 7.68747
E0611 14:26:23.805286 451832 buffer_comparator.cc:145] Difference at 1035: 10.7463, expected 9.28415
E0611 14:26:23.805291 451832 buffer_comparator.cc:145] Difference at 1037: 9.79935, expected 8.46282
E0611 14:26:23.805296 451832 buffer_comparator.cc:145] Difference at 1039: 9.90779, expected 8.7388
2025-06-11 14:26:23.805303: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1165] Results do not match the reference. This is likely a bug/unexpected loss of precision.
System info (python version, jaxlib version, accelerator, etc.)
jax: 0.6.0.dev20250606+4952ad41a
jaxlib: 0.6.0.dev20250609
numpy: 2.3.0
python: 3.11.9 (main, Apr 3 2025, 09:59:09) [GCC 14.2.0]
device info: AMD Instinct MI210-3, 3 local devices"
process_count: 1
platform: uname_result(system='Linux', node='gpunode', release='5.14.0-503.33.1.el9_5.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Mar 19 16:23:31 UTC 2025', machine='x86_64')
I compiled the library with ROCM 6.3.3