[ROCm][CI] Fixes tests for pytorch nightly and python only builds #28979

AndreasKaratzas · 2025-11-19T01:26:03Z

This PR fixes tests for labels:

Pytorch Nightly Dependency Override Check
Python-only Installation Test

It also includes torchaudio package into Dockerfile.

- Fixed setup.py logging format (G001) - Added ROCm support for precompiled wheels - Improved Dockerfile.rocm test stage with hf_transfer and v1 package - Fixed pytorch_nightly_dependency.sh to detect ROCm version correctly - Added source code directory for python_only_compile.sh test Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request introduces fixes for PyTorch nightly and Python-only build tests on the ROCm platform, and adds torchaudio to the ROCm Docker image. The changes in setup.py and the test scripts are well-implemented, improving the robustness of the build process. My review focuses on optimizing the Dockerfiles. I've suggested combining several RUN instructions to reduce image layers, which is a best practice that improves build times and reduces image size. I've also recommended adding apt-get update and cache cleanup to an apt-get install command to ensure correctness and efficiency.

gemini-code-assist · 2025-11-19T01:28:21Z

docker/Dockerfile.rocm

+RUN mkdir src
+RUN mv vllm src/vllm


To optimize the Docker image size and reduce the number of layers, it's a best practice to chain related commands into a single RUN instruction. The mkdir and mv commands can be combined.

RUN mkdir src && mv vllm src/vllm

gemini-code-assist · 2025-11-19T01:28:21Z

docker/Dockerfile.rocm_base

    && python3 --version && python3 -m pip --version

 RUN pip install -U packaging 'cmake<4' ninja wheel 'setuptools<80' pybind11 Cython
+RUN apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox


For correctness and to keep the Docker image size minimal, it's crucial to run apt-get update before apt-get install and to clean up the apt cache in the same RUN layer. This prevents issues with stale package lists and removes unnecessary files.

RUN apt-get update && apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox && rm -rf /var/lib/apt/lists/*

gemini-code-assist · 2025-11-19T01:28:21Z

docker/Dockerfile.rocm_base

+RUN git clone ${PYTORCH_AUDIO_REPO} audio
+RUN cd audio && git checkout ${PYTORCH_AUDIO_BRANCH} \
+    && git submodule update --init --recursive \
+    && pip install -r requirements.txt \
+    && python3 setup.py bdist_wheel --dist-dir=dist \
+    && pip install dist/*.whl


To reduce the number of Docker image layers and improve build efficiency, it's recommended to combine the git clone and the subsequent build commands for torchaudio into a single RUN instruction.

RUN git clone ${PYTORCH_AUDIO_REPO} audio && cd audio \ && git checkout ${PYTORCH_AUDIO_BRANCH} \ && git submodule update --init --recursive \ && pip install -r requirements.txt \ && python3 setup.py bdist_wheel --dist-dir=dist \ && pip install dist/*.whl

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-19T01:33:36Z

setup.py

 # If using precompiled, extract and patch package_data (in advance of setup)
 if envs.VLLM_USE_PRECOMPILED:
-    assert _is_cuda(), "VLLM_USE_PRECOMPILED is only supported for CUDA builds"
+    assert _is_cuda() or _is_hip(), (
+        "VLLM_USE_PRECOMPILED is only supported for CUDA or ROCm builds."


Allowing precompiled mode on ROCm installs CUDA wheel

The precompiled path now explicitly allows _is_hip() (ROCm) builds, but the precompiled wheel download below still always targets the CUDA binary (wheels.vllm.ai/...vllm-1.0.0.dev-cp38-abi3-<arch>.whl, no ROCm variant). On ROCm agents where VLLM_USE_PRECOMPILED=1 is used (e.g., python_only_compile.sh in the AMD pipeline), this will install CUDA .so files without the CUDA runtime present, causing import-time failures instead of a clean skip. Either keep the CUDA-only guard or fetch a ROCm-specific wheel before enabling this path on ROCm.

Useful? React with 👍 / 👎.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gshtras · 2025-11-19T19:46:28Z

docker/Dockerfile.rocm_base

-    pip install -r requirements.txt && git submodule update --init --recursive \
+ARG PYTORCH_AUDIO_REPO
+
+RUN git clone ${PYTORCH_REPO} pytorch && cd pytorch \


Please don't combine these steps
It won't affect the final docker size because of the multi stage build process.
But this allows to cache the git clone stage when you're debugging torch build issues

Thanks for pointing this out. Let me know if the modifications are now inline with the above feedback.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gshtras · 2025-11-20T17:04:56Z

docker/Dockerfile.rocm

 # -----------------------
 # Test vLLM image
 FROM base AS test
+ARG PYTHON_VERSION=3.12


This is being set in the base docker. A better way would be to set it as an env there and inherit in this image

gshtras · 2025-11-20T17:05:46Z

docker/Dockerfile.rocm

+ENV HF_HUB_ENABLE_HF_TRANSFER=1
+
+# Copy in the v1 package for testing (it isn't distributed yet)
+COPY vllm/v1 /usr/local/lib/python${PYTHON_VERSION}/dist-packages/vllm/v1


Looks kinda hacky, why do we need it for tests, but not for the normal distribution?

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

rasmith · 2025-11-21T17:54:33Z

tests/entrypoints/openai/test_translation_validation.py

 SERVER_ARGS = ["--enforce-eager"]


+@pytest.fixture(scope="module", autouse=True)


Could this be put in conftest.py?

Thanks for the review. I moved this into conftest. Let me know if it looks better now.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2025-11-23T04:37:50Z

Added support for Entrypoints Integration Test (Pooling) test group on ROCm.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…tn_triton

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2025-11-24T23:39:41Z

Added FLEX_ATTENTION back since it was an integral part of some encoder only self attention tests under entrypoints test group. Also, updated version of terratorch to 1.1.1 because current version gave error:

FAILED entrypoints/openai/test_vision_embeds.py::test_single_request[ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11]

Click to expand full log

```log FAILED entrypoints/openai/test_vision_embeds.py::test_single_request[ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11]

self = _ModelRegistry(models={'AfmoeForCausalLM': LazyRegisteredModel(module_name='vllm.model_executor.models.afmoe', class...(module_name='vllm.model_executor.models.transformers', class_name='TransformersMultiModalForSequenceClassification')})
architectures = ['Terratorch']
model_config = ModelConfig(model='ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11', runner='pooling', convert='auto', task=No...its_processors=None, io_processor_plugin=None, pooler_config=None, override_pooler_config=None, multimodal_config=None)
def is_text_generation_model(
self,
architectures: str | list[str],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures, model_config)

  return model_cls.is_text_generation_model

E AttributeError: 'NoneType' object has no attribute 'is_text_generation_model'


</details>

AndreasKaratzas · 2025-11-24T23:46:56Z

vllm/platforms/rocm.py

        is_eager_execution = compilation_config == CUDAGraphMode.NONE

        use_aiter_rms_norm = rocm_aiter_ops.is_rmsnorm_enabled()



Added also logic to auto set block size to 64 on ROCm if attention backend is AITER Unified Attention since it yields significant performance gains.

…or terratorch to follow upstream Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added 4 commits November 18, 2025 01:39

Merge remote-tracking branch 'upstream/main' into akaratza_ci

65e6376

[ROCm][CI] fix for pytorch/pytorch standalone tests

6364378

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge upstream/main into akaratza_ci

d3ff04b

AndreasKaratzas requested review from gshtras and tjtanaa as code owners November 19, 2025 01:26

mergify bot added ci/build rocm Related to AMD ROCm labels Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

AndreasKaratzas added 3 commits November 19, 2025 10:40

[ROCm][CI] Merged reviews

16ebdd0

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

42f4b6c

[ROCm][CI] Fixed assertion condition for prebuilt wheels

2722576

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas mentioned this pull request Nov 19, 2025

[AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments #28314

Open

6 tasks

gshtras reviewed Nov 19, 2025

View reviewed changes

AndreasKaratzas added 5 commits November 19, 2025 15:22

[ROCm][CI] Merged reviews

a5b0106

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

0754972

ROCm CI fixes: FlexAttention backend support and test adjustments

b1f171f

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

1a55574

Merge remote-tracking branch 'upstream/main' into akaratza_ci

f23cf89

gshtras reviewed Nov 20, 2025

View reviewed changes

AndreasKaratzas added 2 commits November 20, 2025 19:55

ROCm CI fixes: LoRA related adjustments and whisper test fixes.

72e3a0f

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

d7776cf

AndreasKaratzas requested review from DarkLight1337, LucasWilkinson, NickLucche, aarnphm and robertgshaw2-redhat as code owners November 21, 2025 01:57

mergify bot added the gpt-oss Related to GPT-OSS models label Nov 21, 2025

[ROCm][CI] Increased timeout window for video tests

694f8f4

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

rasmith reviewed Nov 21, 2025

View reviewed changes

AndreasKaratzas added 2 commits November 21, 2025 13:09

[ROCm][CI] Vision tests were not tailored for ROCm backend

4913b2d

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

14c82d4

AndreasKaratzas mentioned this pull request Nov 21, 2025

[CI Failure][AMD] Encoder-Decoder Models Fail on AMD CI #27442

Closed

3 tasks

AndreasKaratzas added 6 commits November 21, 2025 16:27

Merge remote-tracking branch 'upstream/main' into akaratza_ci

e48274a

[ROCm][CI] Resolved

315c44e

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

dc94057

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

32a944b

Merge remote-tracking branch 'upstream/main' into akaratza_ci

84f899b

[ROCm][CI] Resolved

75f7a93

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested a review from noooop as a code owner November 23, 2025 04:36

AndreasKaratzas added 8 commits November 23, 2025 12:50

Merge remote-tracking branch 'upstream/main' into akaratza_ci

702a498

Merge remote-tracking branch 'upstream/main' into akaratza_ci

f6b2fdb

Added Triton encoder only self attention support

2ce0caf

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_encoder_at…

788765c

…tn_triton

Added FlexAttention logic

b31f035

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

a7a09cb

Fixes and other entrypoint tests on ROCm

d93f049

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

146856c

AndreasKaratzas requested a review from tdoublep as a code owner November 24, 2025 23:37

AndreasKaratzas commented Nov 24, 2025

View reviewed changes

AndreasKaratzas added 5 commits November 24, 2025 18:12

Synced Triton kernel with upstream and slightly modified versioning f…

3990acd

…or terratorch to follow upstream Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

9ad6c9d

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

099d05f

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[Bugfix] Both AITER and AITER unified attention need to be set

c10401f

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'upstream/main' into akaratza_ci

18e3c68

AndreasKaratzas mentioned this pull request Nov 25, 2025

[CI Failure]: mi325_1: Python-only Installation Test #29423

Open

3 tasks

		SERVER_ARGS = ["--enforce-eager"]


		@pytest.fixture(scope="module", autouse=True)

		is_eager_execution = compilation_config == CUDAGraphMode.NONE

		use_aiter_rms_norm = rocm_aiter_ops.is_rmsnorm_enabled()

Uh oh!

[ROCm][CI] Fixes tests for pytorch nightly and python only builds #28979

Are you sure you want to change the base?

[ROCm][CI] Fixes tests for pytorch nightly and python only builds #28979

Conversation

AndreasKaratzas commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

rasmith Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas commented Nov 23, 2025

Uh oh!

AndreasKaratzas commented Nov 24, 2025

Uh oh!

AndreasKaratzas Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreasKaratzas commented Nov 19, 2025 •

edited by github-actions bot

Loading

AndreasKaratzas Nov 21, 2025 •

edited

Loading