-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[ROCm][CI] Fixes tests for pytorch nightly and python only builds #28979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fixed setup.py logging format (G001) - Added ROCm support for precompiled wheels - Improved Dockerfile.rocm test stage with hf_transfer and v1 package - Fixed pytorch_nightly_dependency.sh to detect ROCm version correctly - Added source code directory for python_only_compile.sh test Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces fixes for PyTorch nightly and Python-only build tests on the ROCm platform, and adds torchaudio to the ROCm Docker image. The changes in setup.py and the test scripts are well-implemented, improving the robustness of the build process. My review focuses on optimizing the Dockerfiles. I've suggested combining several RUN instructions to reduce image layers, which is a best practice that improves build times and reduces image size. I've also recommended adding apt-get update and cache cleanup to an apt-get install command to ensure correctness and efficiency.
docker/Dockerfile.rocm
Outdated
| RUN mkdir src | ||
| RUN mv vllm src/vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker/Dockerfile.rocm_base
Outdated
| && python3 --version && python3 -m pip --version | ||
|
|
||
| RUN pip install -U packaging 'cmake<4' ninja wheel 'setuptools<80' pybind11 Cython | ||
| RUN apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For correctness and to keep the Docker image size minimal, it's crucial to run apt-get update before apt-get install and to clean up the apt cache in the same RUN layer. This prevents issues with stale package lists and removes unnecessary files.
RUN apt-get update && apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox && rm -rf /var/lib/apt/lists/*
| RUN git clone ${PYTORCH_AUDIO_REPO} audio | ||
| RUN cd audio && git checkout ${PYTORCH_AUDIO_BRANCH} \ | ||
| && git submodule update --init --recursive \ | ||
| && pip install -r requirements.txt \ | ||
| && python3 setup.py bdist_wheel --dist-dir=dist \ | ||
| && pip install dist/*.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the number of Docker image layers and improve build efficiency, it's recommended to combine the git clone and the subsequent build commands for torchaudio into a single RUN instruction.
RUN git clone ${PYTORCH_AUDIO_REPO} audio && cd audio \
&& git checkout ${PYTORCH_AUDIO_BRANCH} \
&& git submodule update --init --recursive \
&& pip install -r requirements.txt \
&& python3 setup.py bdist_wheel --dist-dir=dist \
&& pip install dist/*.whl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
setup.py
Outdated
| # If using precompiled, extract and patch package_data (in advance of setup) | ||
| if envs.VLLM_USE_PRECOMPILED: | ||
| assert _is_cuda(), "VLLM_USE_PRECOMPILED is only supported for CUDA builds" | ||
| assert _is_cuda() or _is_hip(), ( | ||
| "VLLM_USE_PRECOMPILED is only supported for CUDA or ROCm builds." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing precompiled mode on ROCm installs CUDA wheel
The precompiled path now explicitly allows _is_hip() (ROCm) builds, but the precompiled wheel download below still always targets the CUDA binary (wheels.vllm.ai/...vllm-1.0.0.dev-cp38-abi3-<arch>.whl, no ROCm variant). On ROCm agents where VLLM_USE_PRECOMPILED=1 is used (e.g., python_only_compile.sh in the AMD pipeline), this will install CUDA .so files without the CUDA runtime present, causing import-time failures instead of a clean skip. Either keep the CUDA-only guard or fetch a ROCm-specific wheel before enabling this path on ROCm.
Useful? React with 👍 / 👎.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
docker/Dockerfile.rocm_base
Outdated
| pip install -r requirements.txt && git submodule update --init --recursive \ | ||
| ARG PYTORCH_AUDIO_REPO | ||
|
|
||
| RUN git clone ${PYTORCH_REPO} pytorch && cd pytorch \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't combine these steps
It won't affect the final docker size because of the multi stage build process.
But this allows to cache the git clone stage when you're debugging torch build issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Let me know if the modifications are now inline with the above feedback.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
docker/Dockerfile.rocm
Outdated
| # ----------------------- | ||
| # Test vLLM image | ||
| FROM base AS test | ||
| ARG PYTHON_VERSION=3.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being set in the base docker. A better way would be to set it as an env there and inherit in this image
| ENV HF_HUB_ENABLE_HF_TRANSFER=1 | ||
|
|
||
| # Copy in the v1 package for testing (it isn't distributed yet) | ||
| COPY vllm/v1 /usr/local/lib/python${PYTHON_VERSION}/dist-packages/vllm/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks kinda hacky, why do we need it for tests, but not for the normal distribution?
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
| SERVER_ARGS = ["--enforce-eager"] | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module", autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be put in conftest.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I moved this into conftest. Let me know if it looks better now.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Added support for |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Added FLEX_ATTENTION back since it was an integral part of some encoder only self attention tests under entrypoints test group. Also, updated version of terratorch to 1.1.1 because current version gave error: Click to expand full log```log FAILED entrypoints/openai/test_vision_embeds.py::test_single_request[ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11]self = _ModelRegistry(models={'AfmoeForCausalLM': LazyRegisteredModel(module_name='vllm.model_executor.models.afmoe', class...(module_name='vllm.model_executor.models.transformers', class_name='TransformersMultiModalForSequenceClassification')})
E AttributeError: 'NoneType' object has no attribute 'is_text_generation_model' |
| is_eager_execution = compilation_config == CUDAGraphMode.NONE | ||
|
|
||
| use_aiter_rms_norm = rocm_aiter_ops.is_rmsnorm_enabled() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added also logic to auto set block size to 64 on ROCm if attention backend is AITER Unified Attention since it yields significant performance gains.
…or terratorch to follow upstream Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
This PR fixes tests for labels:
Pytorch Nightly Dependency Override CheckPython-only Installation TestIt also includes
torchaudiopackage into Dockerfile.