-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[core][bug] Fix infinite loop on E2BIG error when spawning a process #54838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @mjacar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical issue in Ray where worker processes could fail to launch due to system limitations on command-line argument length (E2BIG error), often caused by large runtime environments. The changes introduce a robust mechanism to detect this specific error during process startup, propagate it through the system, and provide clear, user-friendly error messages, ensuring that users are informed and tasks are properly handled when such failures occur.
Highlights
- Robust E2BIG Error Handling: Implemented a comprehensive solution to detect and handle the
E2BIG
(Argument list too long) error that can occur when launching worker processes, especially with large Ray runtime environments. This ensures that such failures are properly identified and reported. - Improved Process Launch Error Reporting: Refactored the underlying process spawning utility (
src/ray/util/process.cc
) to establish a reliable pipe-based communication mechanism. This allows the parent process to receive specificexecvpe
error codes (likeE2BIG
) from the child process, enabling precise error identification. - Enhanced User Feedback and Status Propagation: Introduced a new
PopWorkerStatus::ArgumentListTooLong
enum value and a dedicated error message. This message is propagated through the worker pool's callback system to provide clear, actionable feedback to users when worker startup fails due to excessively long command-line arguments. - Consistent Task Cancellation: Updated the
LocalTaskManager
to cancel tasks that encounter theArgumentListTooLong
status, aligning its behavior with existing error handling for runtime environment creation failures and ensuring proper resource cleanup.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for the E2BIG
error that can occur when starting worker processes with large runtime environments. The approach involves a significant and robust refactoring of the process creation logic in src/ray/util/process.cc
to reliably capture errors from execvpe
, and then propagating this specific error up to the user.
Overall, the changes are well-structured and address the problem effectively. I've provided some feedback on logging levels, code style for consistency, and a critical issue regarding error handling in the new process creation logic that could lead to a race condition. Addressing these points will improve the robustness and maintainability of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an E2BIG
error when starting worker processes by propagating the error from execvpe
. The changes in worker_pool
and local_task_manager
to handle the new ArgumentListTooLong
status are well-implemented.
However, the modifications in src/ray/util/process.cc
introduce a critical regression in how decoupled processes are handled. The new implementation breaks the mechanism for obtaining the correct process ID for decoupled (grandchild) processes, leading to incorrect process management. I've left a critical comment detailing this issue.
Additionally, there are a few minor formatting inconsistencies (tabs instead of spaces) that should be addressed for code style consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for the E2BIG
error when starting worker processes, which is a valuable improvement. The approach involves adding a new status ArgumentListTooLong
and refactoring the process creation logic to correctly capture execvpe
errors.
The changes are well-structured, but I've identified a critical regression in src/ray/util/process.cc
where the pipe_to_stdin
functionality is broken due to a missing dup2
call. I've also provided a minor stylistic suggestion for src/ray/raylet/worker_pool.cc
to improve code conciseness. Please address the critical issue to ensure no existing functionality is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an E2BIG
(Argument list too long) error that can occur when starting worker processes, particularly with large runtime environments. The core of the change is to correctly propagate the execvpe
error from the child process back to the raylet, allowing it to be handled gracefully. This is achieved by adding a new ArgumentListTooLong
status, updating function signatures to carry error information, and refactoring the process creation logic in process.cc
for more robust error handling.
The changes are logical and address the problem effectively. The refactoring of the process creation logic is a good improvement. I've identified one critical issue that would lead to a crash, and a couple of medium-severity issues related to code clarity and style that should be addressed, as well as a high severity issue related to error handling. Once these points are addressed, the PR should be in good shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an E2BIG
(Argument list too long) error that can occur when starting worker processes, typically with large runtime environments. The core of the change is a significant refactoring of the process creation logic in src/ray/util/process.cc
to correctly propagate the errno
from execvpe
in the child process back to the parent.
The changes introduce a new PopWorkerStatus::ArgumentListTooLong
and update the worker pool and local task manager to handle this new status, which is a solid approach. The new process creation logic for non-decoupled processes (like workers) is robust and effectively solves the stated problem.
However, the refactoring appears to have introduced a regression for decoupled processes, where the specific errno
from a failed execvpe
is lost and replaced with a generic ECHILD
. I've also identified a minor potential issue with handling partial reads from the communication pipe. My feedback below details these points.
Overall, the changes are well-targeted to solve the E2BIG
issue for workers. Addressing the feedback will improve the robustness of process creation across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses the E2BIG
error, enhancing the robustness of worker process creation, particularly when dealing with large runtime environments. The changes include adding a new PopWorkerStatus::ArgumentListTooLong
, propagating the error, and modifying process creation logic. The refactoring in src/ray/util/process.cc
improves error handling during process creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses the E2BIG
error during worker startup by introducing ArgumentListTooLong
status and refactoring process creation in process.cc
. The changes in process.cc
are good, but worker_pool.cc
has a critical flaw that needs to be addressed. The logic for handling the error code is incorrect, leading to compilation errors and incorrect behavior. I've provided a detailed comment and a suggested fix for this. Also, consider logging the error message in local_task_manager.cc
and using std::move
in worker_pool.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for the E2BIG
error that can occur when starting worker processes with large runtime environments. The core change is in src/ray/util/process.cc
, enhancing process creation to propagate execvpe
errors reliably. This allows WorkerPool
to detect and handle the E2BIG
error gracefully. Additionally, there are changes in src/ray/raylet/local_task_manager.cc
and src/ray/raylet/worker_pool.cc
to handle the new ArgumentListTooLong
status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an E2BIG
(argument list too long) error that can occur when starting a new worker process, typically due to a large runtime environment. The changes are comprehensive, starting from detecting the error at the process creation level in process.cc
and propagating it up to the worker_pool
and local_task_manager
. This allows Ray to fail the corresponding task with a clear and specific error message, which is a great improvement for user experience.
The refactoring in src/ray/util/process.cc
to reliably capture the errno
from execvpe
is a significant improvement in correctness. The new logic correctly handles both coupled and decoupled processes.
I've found one critical issue that will cause a compilation failure and another medium-severity issue related to the robustness of a read
call.
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
b399d15
to
7fa2d59
Compare
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
just use serve and core; stop pretending that we have separate packages. removes all unused python compile rules Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
## Why are these changes needed? When deduping schemas, emtpy schemas are not handled correctly. Works for pyarrow, but not for Panda because we use a NamedTuple. The current check uses ``` not schema ``` which handles schemas that are `None` and or length 0(an empty pyarrow schema has length 0 when empty). This is not the case for NamedTuples, So empty Panda blocks are not handled correctly which can cause weird behavior. empty schemas are 1. None 2. Pyarrow: not schema 3. Panda: not schema.names --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
## Why are these changes needed? - We want be able to control read lance table on fragment level. Current `ray.data.read_lance` does not support it and read all fragments even if `scanner_options` has `fragments` When you pass "fragments" to scanner_options you can control which fragments ray will read. --------- Signed-off-by: dshepelev15 <d-shepelev@list.ru> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…t#54832) Ray will errantly cancel the `LongPollClient.listen_for_change` task if the first task to a replica is canceled prior to `listen_for_change` returning. This PR works around the issue by initiating the `listen_for_change` from a background asyncio task with an empty `contextvars.Context`. Closes ray-project#52476 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…ect#54751) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Jan Hilberath <jan@hilberath.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…ray-project#54786) ## Why are these changes needed? Since ray 2.48.0 has been released, the autoscaler v2 in the `latest` ray image should have cluster launcher support, and thus we can test autoscaler v2 on the cluster launcher in release tests. This PR adds those tests. <img width="1684" height="637" alt="image" src="https://github.com/user-attachments/assets/b9032008-fd8b-41ef-839e-10be0692779c" /> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
b37318f
to
d09aed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to handle E2BIG
errors during worker process startup, which is a great improvement for robustness, especially with large runtime environments. The core of the change is a significant and well-thought-out refactoring of the process creation logic in src/ray/util/process.cc
to reliably detect execvpe
failures. The new logic is more complex but provides much-needed error propagation.
The changes in worker_pool
and local_task_manager
correctly utilize this new error information to provide better feedback to the user.
I've found one high-severity correctness issue in the new process creation logic that could lead to undefined behavior in a specific failure scenario. I've provided a detailed comment and a suggested fix for it. Once that is addressed, this PR will be in excellent shape.
Summary of ChangesThis pull request aims to address the Highlights
Changelog
Activity
|
@edoakes @MortalHappiness You guys seem to be the last people to touch |
@mjacar thanks for the very detailed write-up and the contribution!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall, thanks for the contribution.
Could you try adding a Python integration test for this? It can look similar to the reproduction script you provided. I'd suggest adding it to
put the test in `test_runtime_env_standalone.py`. |
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
7f19859
to
131b55a
Compare
@edoakes Thanks for taking a look. I added a test in |
Why are these changes needed?
Summary
This resolves the issue described in #47432 where the raylet process will keep unsuccessfully trying to spawn a process if it hits a
E2BIG
error. This manifests as jobs stuck in aPending
state forever with no clear logging as to what is happening. This happens when the configuredruntime_env
is too big. This PR makes it so that creating a job with aruntime_env
that is too big fails immediately with an error message ofE2BIG error occurred when starting worker process. Worker command arguments likely too long.
inraylet.out
.Reproduction Script
Detailed Write-Up
Unfortunately, I ran into #47432 in the wild and went through the whole process of debugging it before I saw that that issue existed. :(
As stated, this shows up as a job stuck in the
Pending
state forever. As best as I can tell, the only real telemetry or logging that shows anything happening at all in this case is theruntime_env_agent.log
showing the runtime env continually get created and destroyed in an infinite loop. As in the description in #47432, I needed tostrace
the Raylet process and saw that it was trying toexecve
in a loop and was gettingE2BIG (Argument list too long)
error every single time.Therefore, the fundamental idea of this PR is to bubble up this error all the way from the
StartProcess
call inworker_pool.cc
to thePoppedWorkerHandler
inlocal_task_manager.cc
so that the local task manager can detect this error and not retry. This part of the change was reasonably straightforward.Unfortunately, this revealed another bug with how error codes are propagated between parent processes and child processes in the
spawnvpe
function inprocess.cc
.The
process.cc
Bug: A Startup Race ConditionWhen creating a new "decoupled" worker process on Unix-like systems, Ray uses a standard "double fork" technique. The raylet forks a child, which immediately forks a grandchild (the actual worker) and then exits. This orphans the grandchild, which is then adopted by the system's
init
process. This ensures the worker can outlive its immediate parent.The bug was in how the raylet (the original parent) confirmed that the grandchild worker started successfully. The old implementation worked like this:
execve
to become the actual worker program.The raylet assumed the launch was successful as soon as it received the PID. However, this confirmation happened before the grandchild actually tried to run
execve
. If theexecve
call failed (as it does withE2BIG
), the grandchild would simply exit, but the raylet would never know about this failure. Concretely, this meant that the error code associated with theStartProcess
call inworker_pool.cc
would be0
despite it actually failing.The Fix: A Robust Communication Protocol
This PR fixes the race condition by introducing a much more robust communication protocol using a dedicated "status pipe" and the
close-on-exec
flag.FD_CLOEXEC
(close-on-exec
) flag on its end of this pipe. This is the crucial part of the fix:execve
succeeds, the operating system automatically closes the child's end of the pipe. The parent sees this as an "end-of-file" (EOF) and knows with certainty that the launch was successful.execve
fails, the code after theexecve
call runs. This code now explicitly writes the systemerrno
(e.g.,E2BIG
) into the pipe before the child exits. The parent reads this error code and knows the exact reason for the launch failure.With this fix in
process.cc
,execve
failures are now correctly propagated back to theStartProcess
call. The correct error code is now set. TheE2BIG
error is no longer lost and is correctly bubbled up to theLocalTaskManager
, which can now fail the task submission and prevent the infinite retry loop.The user now gets an immediate and clear error message instead of a mysteriously pending job.
Related issue number
Closes #47432
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.