Skip to content

[core][bug] Fix infinite loop on E2BIG error when spawning a process #54838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

mjacar
Copy link

@mjacar mjacar commented Jul 22, 2025

Why are these changes needed?

Summary

This resolves the issue described in #47432 where the raylet process will keep unsuccessfully trying to spawn a process if it hits a E2BIG error. This manifests as jobs stuck in a Pending state forever with no clear logging as to what is happening. This happens when the configured runtime_env is too big. This PR makes it so that creating a job with a runtime_env that is too big fails immediately with an error message of E2BIG error occurred when starting worker process. Worker command arguments likely too long. in raylet.out.

Reproduction Script

import ray
import os
import time

print("Starting Ray to reproduce E2BIG error...")

# 1. Define a massive dictionary of environment variables.
large_env_vars = {
    "MY_HUGE_VAR": "X" * 4096 * 100
}

# 2. Initialize Ray with this huge runtime_env.
# This doesn't trigger the error yet. The runtime_env is just registered.
ray.init(
    runtime_env={"env_vars": large_env_vars}
)

print("Ray initialized. Defining and calling a remote task...")

# 3. Define a simple remote function.
# When we call this function, Ray will try to start a new worker.
@ray.remote
def my_task():
    # We can check if one of the vars is present to confirm success.
    return f"Successfully accessed env var: {os.environ.get('MY_HUGE_VAR', 'NOT_FOUND')}"

# 4. Trigger the worker creation process.
# This is the step that will fail because the execve call for the new
# worker process will have an argument list that is too long.
try:
    print("Calling remote task. This may hang or fail if the error is reproduced.")
    future = my_task.remote()
    result = ray.get(future, timeout=None)
    print(f"SUCCESS! Task completed with result: {result}")
except Exception as e:
    print(f"\nCaught an exception, likely due to worker startup failure: {e}")

print("Script finished. Check the raylet logs or use strace to confirm the E2BIG error.")
ray.shutdown()

Detailed Write-Up

Unfortunately, I ran into #47432 in the wild and went through the whole process of debugging it before I saw that that issue existed. :(

As stated, this shows up as a job stuck in the Pending state forever. As best as I can tell, the only real telemetry or logging that shows anything happening at all in this case is the runtime_env_agent.log showing the runtime env continually get created and destroyed in an infinite loop. As in the description in #47432, I needed to strace the Raylet process and saw that it was trying to execve in a loop and was getting E2BIG (Argument list too long) error every single time.

Therefore, the fundamental idea of this PR is to bubble up this error all the way from the StartProcess call in worker_pool.cc to the PoppedWorkerHandler in local_task_manager.cc so that the local task manager can detect this error and not retry. This part of the change was reasonably straightforward.

Unfortunately, this revealed another bug with how error codes are propagated between parent processes and child processes in the spawnvpe function in process.cc.

The process.cc Bug: A Startup Race Condition

When creating a new "decoupled" worker process on Unix-like systems, Ray uses a standard "double fork" technique. The raylet forks a child, which immediately forks a grandchild (the actual worker) and then exits. This orphans the grandchild, which is then adopted by the system's init process. This ensures the worker can outlive its immediate parent.

The bug was in how the raylet (the original parent) confirmed that the grandchild worker started successfully. The old implementation worked like this:

  1. Raylet creates a pipe.
  2. The grandchild process writes its Process ID (PID) back to the raylet through the pipe.
  3. The grandchild then calls execve to become the actual worker program.

The raylet assumed the launch was successful as soon as it received the PID. However, this confirmation happened before the grandchild actually tried to run execve. If the execve call failed (as it does with E2BIG), the grandchild would simply exit, but the raylet would never know about this failure. Concretely, this meant that the error code associated with the StartProcess call in worker_pool.cc would be 0 despite it actually failing.

The Fix: A Robust Communication Protocol

This PR fixes the race condition by introducing a much more robust communication protocol using a dedicated "status pipe" and the close-on-exec flag.

  1. A "status pipe" is created solely for reporting the status of the launch.
  2. The child process sets the FD_CLOEXEC (close-on-exec) flag on its end of this pipe. This is the crucial part of the fix:
    • If execve succeeds, the operating system automatically closes the child's end of the pipe. The parent sees this as an "end-of-file" (EOF) and knows with certainty that the launch was successful.
    • If execve fails, the code after the execve call runs. This code now explicitly writes the system errno (e.g., E2BIG) into the pipe before the child exits. The parent reads this error code and knows the exact reason for the launch failure.

With this fix in process.cc, execve failures are now correctly propagated back to the StartProcess call. The correct error code is now set. The E2BIG error is no longer lost and is correctly bubbled up to the LocalTaskManager, which can now fail the task submission and prevent the infinite retry loop.

The user now gets an immediate and clear error message instead of a mysteriously pending job.

Related issue number

Closes #47432

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Existing tests
    • The reproduction script ceases to reproduce the error
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mjacar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in Ray where worker processes could fail to launch due to system limitations on command-line argument length (E2BIG error), often caused by large runtime environments. The changes introduce a robust mechanism to detect this specific error during process startup, propagate it through the system, and provide clear, user-friendly error messages, ensuring that users are informed and tasks are properly handled when such failures occur.

Highlights

  • Robust E2BIG Error Handling: Implemented a comprehensive solution to detect and handle the E2BIG (Argument list too long) error that can occur when launching worker processes, especially with large Ray runtime environments. This ensures that such failures are properly identified and reported.
  • Improved Process Launch Error Reporting: Refactored the underlying process spawning utility (src/ray/util/process.cc) to establish a reliable pipe-based communication mechanism. This allows the parent process to receive specific execvpe error codes (like E2BIG) from the child process, enabling precise error identification.
  • Enhanced User Feedback and Status Propagation: Introduced a new PopWorkerStatus::ArgumentListTooLong enum value and a dedicated error message. This message is propagated through the worker pool's callback system to provide clear, actionable feedback to users when worker startup fails due to excessively long command-line arguments.
  • Consistent Task Cancellation: Updated the LocalTaskManager to cancel tasks that encounter the ArgumentListTooLong status, aligning its behavior with existing error handling for runtime environment creation failures and ensuring proper resource cleanup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for the E2BIG error that can occur when starting worker processes with large runtime environments. The approach involves a significant and robust refactoring of the process creation logic in src/ray/util/process.cc to reliably capture errors from execvpe, and then propagating this specific error up to the user.

Overall, the changes are well-structured and address the problem effectively. I've provided some feedback on logging levels, code style for consistency, and a critical issue regarding error handling in the new process creation logic that could lead to a race condition. Addressing these points will improve the robustness and maintainability of the code.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an E2BIG error when starting worker processes by propagating the error from execvpe. The changes in worker_pool and local_task_manager to handle the new ArgumentListTooLong status are well-implemented.

However, the modifications in src/ray/util/process.cc introduce a critical regression in how decoupled processes are handled. The new implementation breaks the mechanism for obtaining the correct process ID for decoupled (grandchild) processes, leading to incorrect process management. I've left a critical comment detailing this issue.

Additionally, there are a few minor formatting inconsistencies (tabs instead of spaces) that should be addressed for code style consistency.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for the E2BIG error when starting worker processes, which is a valuable improvement. The approach involves adding a new status ArgumentListTooLong and refactoring the process creation logic to correctly capture execvpe errors.

The changes are well-structured, but I've identified a critical regression in src/ray/util/process.cc where the pipe_to_stdin functionality is broken due to a missing dup2 call. I've also provided a minor stylistic suggestion for src/ray/raylet/worker_pool.cc to improve code conciseness. Please address the critical issue to ensure no existing functionality is broken.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an E2BIG (Argument list too long) error that can occur when starting worker processes, particularly with large runtime environments. The core of the change is to correctly propagate the execvpe error from the child process back to the raylet, allowing it to be handled gracefully. This is achieved by adding a new ArgumentListTooLong status, updating function signatures to carry error information, and refactoring the process creation logic in process.cc for more robust error handling.

The changes are logical and address the problem effectively. The refactoring of the process creation logic is a good improvement. I've identified one critical issue that would lead to a crash, and a couple of medium-severity issues related to code clarity and style that should be addressed, as well as a high severity issue related to error handling. Once these points are addressed, the PR should be in good shape.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an E2BIG (Argument list too long) error that can occur when starting worker processes, typically with large runtime environments. The core of the change is a significant refactoring of the process creation logic in src/ray/util/process.cc to correctly propagate the errno from execvpe in the child process back to the parent.

The changes introduce a new PopWorkerStatus::ArgumentListTooLong and update the worker pool and local task manager to handle this new status, which is a solid approach. The new process creation logic for non-decoupled processes (like workers) is robust and effectively solves the stated problem.

However, the refactoring appears to have introduced a regression for decoupled processes, where the specific errno from a failed execvpe is lost and replaced with a generic ECHILD. I've also identified a minor potential issue with handling partial reads from the communication pipe. My feedback below details these points.

Overall, the changes are well-targeted to solve the E2BIG issue for workers. Addressing the feedback will improve the robustness of process creation across the board.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses the E2BIG error, enhancing the robustness of worker process creation, particularly when dealing with large runtime environments. The changes include adding a new PopWorkerStatus::ArgumentListTooLong, propagating the error, and modifying process creation logic. The refactoring in src/ray/util/process.cc improves error handling during process creation.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses the E2BIG error during worker startup by introducing ArgumentListTooLong status and refactoring process creation in process.cc. The changes in process.cc are good, but worker_pool.cc has a critical flaw that needs to be addressed. The logic for handling the error code is incorrect, leading to compilation errors and incorrect behavior. I've provided a detailed comment and a suggested fix for this. Also, consider logging the error message in local_task_manager.cc and using std::move in worker_pool.cc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for the E2BIG error that can occur when starting worker processes with large runtime environments. The core change is in src/ray/util/process.cc, enhancing process creation to propagate execvpe errors reliably. This allows WorkerPool to detect and handle the E2BIG error gracefully. Additionally, there are changes in src/ray/raylet/local_task_manager.cc and src/ray/raylet/worker_pool.cc to handle the new ArgumentListTooLong status.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an E2BIG (argument list too long) error that can occur when starting a new worker process, typically due to a large runtime environment. The changes are comprehensive, starting from detecting the error at the process creation level in process.cc and propagating it up to the worker_pool and local_task_manager. This allows Ray to fail the corresponding task with a clear and specific error message, which is a great improvement for user experience.

The refactoring in src/ray/util/process.cc to reliably capture the errno from execvpe is a significant improvement in correctness. The new logic correctly handles both coupled and decoupled processes.

I've found one critical issue that will cause a compilation failure and another medium-severity issue related to the robustness of a read call.

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mjacar mjacar force-pushed the fix-e2big-problem branch 2 times, most recently from b399d15 to 7fa2d59 Compare July 23, 2025 03:00
mjacar and others added 17 commits July 23, 2025 05:59
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
just use serve and core; stop pretending that we have separate packages.

removes all unused python compile rules

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
)

Experimental feature we're in the process of deprecating; skip the flaky
test.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
## Why are these changes needed?
When deduping schemas, emtpy schemas are not handled correctly. Works
for pyarrow, but not for Panda because we use a NamedTuple. The current
check uses
```
not schema
```
which handles schemas that are `None` and or length 0(an empty pyarrow
schema has length 0 when empty). This is not the case for NamedTuples,
So empty Panda blocks are not handled correctly which can cause weird
behavior.

empty schemas are
1. None
2. Pyarrow: not schema
3. Panda: not schema.names

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
## Why are these changes needed?
- We want be able to control read lance table on fragment level. Current
`ray.data.read_lance` does not support it and read all fragments even if
`scanner_options` has `fragments`

When you pass "fragments" to scanner_options you can control which
fragments ray will read.

---------

Signed-off-by: dshepelev15 <d-shepelev@list.ru>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…t#54832)

Ray will errantly cancel the `LongPollClient.listen_for_change` task if
the first task to a replica is canceled prior to `listen_for_change`
returning.

This PR works around the issue by initiating the `listen_for_change`
from a background asyncio task with an empty `contextvars.Context`.

Closes ray-project#52476

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
mjacar and others added 11 commits July 23, 2025 05:59
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…ect#54751)

Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Jan Hilberath <jan@hilberath.de>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…ray-project#54786)

## Why are these changes needed?

Since ray 2.48.0 has been released, the autoscaler v2 in the `latest`
ray image should have cluster launcher support, and thus we can test
autoscaler v2 on the cluster launcher in release tests.

This PR adds those tests.

<img width="1684" height="637" alt="image"
src="https://github.com/user-attachments/assets/b9032008-fd8b-41ef-839e-10be0692779c"
/>

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [x] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Rueian <rueian@anyscale.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
@mjacar mjacar force-pushed the fix-e2big-problem branch from b37318f to d09aed0 Compare July 23, 2025 05:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle E2BIG errors during worker process startup, which is a great improvement for robustness, especially with large runtime environments. The core of the change is a significant and well-thought-out refactoring of the process creation logic in src/ray/util/process.cc to reliably detect execvpe failures. The new logic is more complex but provides much-needed error propagation.

The changes in worker_pool and local_task_manager correctly utilize this new error information to provide better feedback to the user.

I've found one high-severity correctness issue in the new process creation logic that could lead to undefined behavior in a specific failure scenario. I've provided a detailed comment and a suggested fix for it. Once that is addressed, this PR will be in excellent shape.

Copy link
Contributor

Summary of Changes

This pull request aims to address the E2BIG error that can occur when starting worker processes, particularly when large runtime environments lead to excessively long command-line arguments. It introduces a new PopWorkerStatus to specifically handle this error and significantly improves the error propagation and process management mechanisms for worker startup failures, providing more detailed error messages to the user.

Highlights

  • Error Handling for E2BIG: Introduced PopWorkerStatus::ArgumentListTooLong to specifically identify and handle cases where worker processes fail to start due to command-line arguments being too long (E2BIG error), providing a more granular failure status.
  • Enhanced Worker Startup Error Propagation: Modified the worker pool's callback mechanisms to pass a detailed runtime_env_setup_error_message when worker startup fails, allowing for more informative error reporting to the user about the cause of the failure.
  • Refactored Process Spawning Logic: The underlying process spawning utility (ProcessFD::Spawn) has been significantly refactored to robustly capture and propagate execvpe errors from child processes back to the parent, improving reliability and debuggability of worker startup failures.
  • Improved Parent-Child Communication: Implemented a new pipe-based mechanism for the parent process to receive the child's PID and any execvpe error codes, addressing previous issues with ignored return values and potential zombie processes, especially for decoupled worker processes.
Changelog
  • src/ray/raylet/local_task_manager.cc
    • Tasks are now cancelled for ArgumentListTooLong status in addition to RuntimeEnvCreationFailed.
    • Updated comments to reflect handling of general worker startup failures.
  • src/ray/raylet/worker_pool.cc
    • PopWorkerCallbackAsync and PopWorkerCallbackInternal now accept an error message string.
    • StartWorkerProcess now explicitly handles E2BIG errors and other process startup failures, setting appropriate PopWorkerStatus and error messages.
    • StartProcess signature updated to pass std::error_code by reference.
    • Specific error message provided for ArgumentListTooLong status when calling PopWorkerCallbackAsync.
  • src/ray/raylet/worker_pool.h
    • Added ArgumentListTooLong enum value to PopWorkerStatus with a description.
    • Updated function signatures for StartProcess, PopWorkerCallbackInternal, and PopWorkerCallbackAsync to include new error handling parameters.
  • src/ray/raylet/worker_pool_test.cc
    • Mock functions updated to match new parameter lists for PopWorkerCallbackAsync and StartProcess.
  • src/ray/util/process.cc
    • Introduced ReadBytesFromFd helper for robust file descriptor reading.
    • Major refactoring of ProcessFD::Spawn to use a status_pipe for child process PID and error communication.
    • Improved error handling for fork and pipe system calls.
    • Revised logic for decoupled processes to correctly retrieve grandchild PIDs and propagate execvpe errors.
Activity
  • The pull request author, mjacar, initiated this "Work In Progress" PR to fix an E2BIG problem.
  • gemini-code-assist[bot] has provided extensive automated review feedback, identifying numerous issues.
  • Critical Issues Identified:
    • Regression in spawnvpe for decoupled processes, leading to incorrect PID, broken lifetime tracking, and race conditions in error reporting (Comment 2223710079).
    • Regression in pipe_to_stdin logic in spawnvpe (missing dup2 call) (Comment 2223762118).
    • Incorrect RAY_CHECK in PopWorkerCallbackAsync that would cause crashes (Comment 2223771630).
    • Initial implementation of StartProcess had ec checked before assignment, missing call, and extraneous brace (Comment 2223839327).
    • Potential resource leak if pipe fails in ProcessFD::Spawn (Comment 2223830064).
    • Potential correctness issue with partial reads for err_from_child in non-decoupled processes, leading to undefined behavior (Comment 2225689234).
  • High Priority Issues Identified:
    • Loss of original errno for decoupled processes, reporting ECHILD instead of the actual error (Comment 2223784017).
    • Ignored write return value in child process when reporting execvpe error (Comment 2223640494).
    • Suggested logging error and returning failure status instead of FATAL for other errors in StartWorkerProcess (Comment 2223852325).
  • Medium Priority Issues Identified:
    • Numerous indentation inconsistencies (tabs vs. spaces) across multiple files (worker_pool.cc, worker_pool.h).
    • Suggestions for more robust read checks, simplifying variable initialization, and using std::move.
    • Recommendations for adding more specific log messages and updating comments for clarity.
    • Concerns about ignored waitpid return values and potential zombie processes, with detailed explanations provided by the bot (Comment 2223640529).
  • mjacar has been actively engaging with the bot's feedback, asking for specific code suggestions and clarifications on the identified issues.
  • The PR is still marked as "WIP" and has several critical and high-priority issues that need to be addressed.

@mjacar mjacar marked this pull request as ready for review July 23, 2025 17:06
@mjacar mjacar requested a review from a team as a code owner July 23, 2025 17:06
@mjacar mjacar changed the title WIP: Fix E2BIG problem [core][bug] Fix infinite loop on E2BIG error when spawning a process Jul 23, 2025
@mjacar
Copy link
Author

mjacar commented Jul 23, 2025

@edoakes @MortalHappiness You guys seem to be the last people to touch process.cc so maybe you would at least know who to forward this to for code review.

@ray-gardener ray-gardener bot added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 23, 2025
@jjyao
Copy link
Collaborator

jjyao commented Jul 23, 2025

@mjacar thanks for the very detailed write-up and the contribution!!

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, thanks for the contribution.

Could you try adding a Python integration test for this? It can look similar to the reproduction script you provided. I'd suggest adding it to

put the test in `test_runtime_env_standalone.py`.

mjacar and others added 2 commits July 23, 2025 18:06
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Michael Acar <michaeljacar@protonmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
@mjacar mjacar force-pushed the fix-e2big-problem branch from 7f19859 to 131b55a Compare July 23, 2025 23:20
@mjacar
Copy link
Author

mjacar commented Jul 23, 2025

@edoakes Thanks for taking a look. I added a test in test_runtime_env.py like you asked.

mjacar and others added 2 commits July 24, 2025 03:56
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Long runtime env hangs the program silently
9 participants