[core][bug] Fix infinite loop on E2BIG error when spawning a process #54838

mjacar · 2025-07-22T19:25:33Z

Why are these changes needed?

Summary

This resolves the issue described in #47432 where the raylet process will keep unsuccessfully trying to spawn a process if it hits a E2BIG error. This manifests as jobs stuck in a Pending state forever with no clear logging as to what is happening. This happens when the configured runtime_env is too big. This PR makes it so that creating a job with a runtime_env that is too big fails immediately with an error message of E2BIG error occurred when starting worker process. Worker command arguments likely too long. in raylet.out.

Reproduction Script

import ray
import os
import time

print("Starting Ray to reproduce E2BIG error...")

# 1. Define a massive dictionary of environment variables.
large_env_vars = {
    "MY_HUGE_VAR": "X" * 4096 * 100
}

# 2. Initialize Ray with this huge runtime_env.
# This doesn't trigger the error yet. The runtime_env is just registered.
ray.init(
    runtime_env={"env_vars": large_env_vars}
)

print("Ray initialized. Defining and calling a remote task...")

# 3. Define a simple remote function.
# When we call this function, Ray will try to start a new worker.
@ray.remote
def my_task():
    # We can check if one of the vars is present to confirm success.
    return f"Successfully accessed env var: {os.environ.get('MY_HUGE_VAR', 'NOT_FOUND')}"

# 4. Trigger the worker creation process.
# This is the step that will fail because the execve call for the new
# worker process will have an argument list that is too long.
try:
    print("Calling remote task. This may hang or fail if the error is reproduced.")
    future = my_task.remote()
    result = ray.get(future, timeout=None)
    print(f"SUCCESS! Task completed with result: {result}")
except Exception as e:
    print(f"\nCaught an exception, likely due to worker startup failure: {e}")

print("Script finished. Check the raylet logs or use strace to confirm the E2BIG error.")
ray.shutdown()

Detailed Write-Up

Unfortunately, I ran into #47432 in the wild and went through the whole process of debugging it before I saw that that issue existed. :(

As stated, this shows up as a job stuck in the Pending state forever. As best as I can tell, the only real telemetry or logging that shows anything happening at all in this case is the runtime_env_agent.log showing the runtime env continually get created and destroyed in an infinite loop. As in the description in #47432, I needed to strace the Raylet process and saw that it was trying to execve in a loop and was getting E2BIG (Argument list too long) error every single time.

Therefore, the fundamental idea of this PR is to bubble up this error all the way from the StartProcess call in worker_pool.cc to the PoppedWorkerHandler in local_task_manager.cc so that the local task manager can detect this error and not retry. This part of the change was reasonably straightforward.

Unfortunately, this revealed another bug with how error codes are propagated between parent processes and child processes in the spawnvpe function in process.cc.

The `process.cc` Bug: A Startup Race Condition

When creating a new "decoupled" worker process on Unix-like systems, Ray uses a standard "double fork" technique. The raylet forks a child, which immediately forks a grandchild (the actual worker) and then exits. This orphans the grandchild, which is then adopted by the system's init process. This ensures the worker can outlive its immediate parent.

The bug was in how the raylet (the original parent) confirmed that the grandchild worker started successfully. The old implementation worked like this:

Raylet creates a pipe.
The grandchild process writes its Process ID (PID) back to the raylet through the pipe.
The grandchild then calls execve to become the actual worker program.

The raylet assumed the launch was successful as soon as it received the PID. However, this confirmation happened before the grandchild actually tried to run execve. If the execve call failed (as it does with E2BIG), the grandchild would simply exit, but the raylet would never know about this failure. Concretely, this meant that the error code associated with the StartProcess call in worker_pool.cc would be 0 despite it actually failing.

The Fix: A Robust Communication Protocol

This PR fixes the race condition by introducing a much more robust communication protocol using a dedicated "status pipe" and the close-on-exec flag.

A "status pipe" is created solely for reporting the status of the launch.
The child process sets the FD_CLOEXEC (close-on-exec) flag on its end of this pipe. This is the crucial part of the fix:
- If execve succeeds, the operating system automatically closes the child's end of the pipe. The parent sees this as an "end-of-file" (EOF) and knows with certainty that the launch was successful.
- If execve fails, the code after the execve call runs. This code now explicitly writes the system errno (e.g., E2BIG) into the pipe before the child exits. The parent reads this error code and knows the exact reason for the launch failure.

With this fix in process.cc, execve failures are now correctly propagated back to the StartProcess call. The correct error code is now set. The E2BIG error is no longer lost and is correctly bubbled up to the LocalTaskManager, which can now fail the task submission and prevent the infinite retry loop.

The user now gets an immediate and clear error message instead of a mysteriously pending job.

Related issue number

Closes #47432

Checks

gemini-code-assist

Summary of Changes

Hello @mjacar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in Ray where worker processes could fail to launch due to system limitations on command-line argument length (E2BIG error), often caused by large runtime environments. The changes introduce a robust mechanism to detect this specific error during process startup, propagate it through the system, and provide clear, user-friendly error messages, ensuring that users are informed and tasks are properly handled when such failures occur.

Highlights

Robust E2BIG Error Handling: Implemented a comprehensive solution to detect and handle the E2BIG (Argument list too long) error that can occur when launching worker processes, especially with large Ray runtime environments. This ensures that such failures are properly identified and reported.
Improved Process Launch Error Reporting: Refactored the underlying process spawning utility (src/ray/util/process.cc) to establish a reliable pipe-based communication mechanism. This allows the parent process to receive specific execvpe error codes (like E2BIG) from the child process, enabling precise error identification.
Enhanced User Feedback and Status Propagation: Introduced a new PopWorkerStatus::ArgumentListTooLong enum value and a dedicated error message. This message is propagated through the worker pool's callback system to provide clear, actionable feedback to users when worker startup fails due to excessively long command-line arguments.
Consistent Task Cancellation: Updated the LocalTaskManager to cancel tasks that encounter the ArgumentListTooLong status, aligning its behavior with existing error handling for runtime environment creation failures and ensuring proper resource cleanup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fix for the E2BIG error that can occur when starting worker processes with large runtime environments. The approach involves a significant and robust refactoring of the process creation logic in src/ray/util/process.cc to reliably capture errors from execvpe, and then propagating this specific error up to the user.

Overall, the changes are well-structured and address the problem effectively. I've provided some feedback on logging levels, code style for consistency, and a critical issue regarding error handling in the new process creation logic that could lead to a race condition. Addressing these points will improve the robustness and maintainability of the code.

src/ray/util/process.cc

src/ray/raylet/worker_pool.cc

src/ray/util/process.cc

gemini-code-assist

Code Review

This pull request aims to fix an E2BIG error when starting worker processes by propagating the error from execvpe. The changes in worker_pool and local_task_manager to handle the new ArgumentListTooLong status are well-implemented.

However, the modifications in src/ray/util/process.cc introduce a critical regression in how decoupled processes are handled. The new implementation breaks the mechanism for obtaining the correct process ID for decoupled (grandchild) processes, leading to incorrect process management. I've left a critical comment detailing this issue.

Additionally, there are a few minor formatting inconsistencies (tabs instead of spaces) that should be addressed for code style consistency.

src/ray/util/process.cc

src/ray/raylet/worker_pool.cc

src/ray/raylet/worker_pool.h

gemini-code-assist

Code Review

This pull request introduces a fix for the E2BIG error when starting worker processes, which is a valuable improvement. The approach involves adding a new status ArgumentListTooLong and refactoring the process creation logic to correctly capture execvpe errors.

The changes are well-structured, but I've identified a critical regression in src/ray/util/process.cc where the pipe_to_stdin functionality is broken due to a missing dup2 call. I've also provided a minor stylistic suggestion for src/ray/raylet/worker_pool.cc to improve code conciseness. Please address the critical issue to ensure no existing functionality is broken.

src/ray/util/process.cc

src/ray/raylet/worker_pool.cc

gemini-code-assist

Code Review

This pull request aims to fix an E2BIG (Argument list too long) error that can occur when starting worker processes, particularly with large runtime environments. The core of the change is to correctly propagate the execvpe error from the child process back to the raylet, allowing it to be handled gracefully. This is achieved by adding a new ArgumentListTooLong status, updating function signatures to carry error information, and refactoring the process creation logic in process.cc for more robust error handling.

The changes are logical and address the problem effectively. The refactoring of the process creation logic is a good improvement. I've identified one critical issue that would lead to a crash, and a couple of medium-severity issues related to code clarity and style that should be addressed, as well as a high severity issue related to error handling. Once these points are addressed, the PR should be in good shape.

src/ray/raylet/worker_pool.cc

src/ray/raylet/local_task_manager.cc

src/ray/raylet/worker_pool.cc

gemini-code-assist

Code Review

This pull request aims to fix an E2BIG (Argument list too long) error that can occur when starting worker processes, typically with large runtime environments. The core of the change is a significant refactoring of the process creation logic in src/ray/util/process.cc to correctly propagate the errno from execvpe in the child process back to the parent.

The changes introduce a new PopWorkerStatus::ArgumentListTooLong and update the worker pool and local task manager to handle this new status, which is a solid approach. The new process creation logic for non-decoupled processes (like workers) is robust and effectively solves the stated problem.

However, the refactoring appears to have introduced a regression for decoupled processes, where the specific errno from a failed execvpe is lost and replaced with a generic ECHILD. I've also identified a minor potential issue with handling partial reads from the communication pipe. My feedback below details these points.

Overall, the changes are well-targeted to solve the E2BIG issue for workers. Addressing the feedback will improve the robustness of process creation across the board.

src/ray/util/process.cc

gemini-code-assist

Code Review

This pull request addresses the E2BIG error, enhancing the robustness of worker process creation, particularly when dealing with large runtime environments. The changes include adding a new PopWorkerStatus::ArgumentListTooLong, propagating the error, and modifying process creation logic. The refactoring in src/ray/util/process.cc improves error handling during process creation.

src/ray/util/process.cc

src/ray/raylet/local_task_manager.cc

src/ray/raylet/worker_pool.cc

gemini-code-assist

Code Review

This pull request addresses the E2BIG error during worker startup by introducing ArgumentListTooLong status and refactoring process creation in process.cc. The changes in process.cc are good, but worker_pool.cc has a critical flaw that needs to be addressed. The logic for handling the error code is incorrect, leading to compilation errors and incorrect behavior. I've provided a detailed comment and a suggested fix for this. Also, consider logging the error message in local_task_manager.cc and using std::move in worker_pool.cc.

src/ray/raylet/worker_pool.cc

src/ray/raylet/local_task_manager.cc

src/ray/raylet/worker_pool.cc

gemini-code-assist

Code Review

This pull request introduces a fix for the E2BIG error that can occur when starting worker processes with large runtime environments. The core change is in src/ray/util/process.cc, enhancing process creation to propagate execvpe errors reliably. This allows WorkerPool to detect and handle the E2BIG error gracefully. Additionally, there are changes in src/ray/raylet/local_task_manager.cc and src/ray/raylet/worker_pool.cc to handle the new ArgumentListTooLong status.

src/ray/raylet/worker_pool.cc

src/ray/util/process.cc

src/ray/raylet/local_task_manager.cc

gemini-code-assist

Code Review

This pull request aims to fix an E2BIG (argument list too long) error that can occur when starting a new worker process, typically due to a large runtime environment. The changes are comprehensive, starting from detecting the error at the process creation level in process.cc and propagating it up to the worker_pool and local_task_manager. This allows Ray to fail the corresponding task with a clear and specific error message, which is a great improvement for user experience.

The refactoring in src/ray/util/process.cc to reliably capture the errno from execvpe is a significant improvement in correctness. The new logic correctly handles both coupled and decoupled processes.

I've found one critical issue that will cause a compilation failure and another medium-severity issue related to the robustness of a read call.

src/ray/raylet/worker_pool.cc

src/ray/util/process.cc

gemini-code-assist · 2025-07-22T21:31:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

just use serve and core; stop pretending that we have separate packages. removes all unused python compile rules Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

) Experimental feature we're in the process of deprecating; skip the flaky test. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

## Why are these changes needed? When deduping schemas, emtpy schemas are not handled correctly. Works for pyarrow, but not for Panda because we use a NamedTuple. The current check uses ``` not schema ``` which handles schemas that are `None` and or length 0(an empty pyarrow schema has length 0 when empty). This is not the case for NamedTuples, So empty Panda blocks are not handled correctly which can cause weird behavior. empty schemas are 1. None 2. Pyarrow: not schema 3. Panda: not schema.names --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

## Why are these changes needed? - We want be able to control read lance table on fragment level. Current `ray.data.read_lance` does not support it and read all fragments even if `scanner_options` has `fragments` When you pass "fragments" to scanner_options you can control which fragments ray will read. --------- Signed-off-by: dshepelev15 <d-shepelev@list.ru> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

…t#54832) Ray will errantly cancel the `LongPollClient.listen_for_change` task if the first task to a replica is canceled prior to `listen_for_change` returning. This PR works around the issue by initiating the `listen_for_change` from a background asyncio task with an empty `contextvars.Context`. Closes ray-project#52476 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

…ect#54751) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Jan Hilberath <jan@hilberath.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

…ray-project#54786) ## Why are these changes needed? Since ray 2.48.0 has been released, the autoscaler v2 in the `latest` ray image should have cluster launcher support, and thus we can test autoscaler v2 on the cluster launcher in release tests. This PR adds those tests. <img width="1684" height="637" alt="image" src="https://github.com/user-attachments/assets/b9032008-fd8b-41ef-839e-10be0692779c" /> ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a mechanism to handle E2BIG errors during worker process startup, which is a great improvement for robustness, especially with large runtime environments. The core of the change is a significant and well-thought-out refactoring of the process creation logic in src/ray/util/process.cc to reliably detect execvpe failures. The new logic is more complex but provides much-needed error propagation.

The changes in worker_pool and local_task_manager correctly utilize this new error information to provide better feedback to the user.

I've found one high-severity correctness issue in the new process creation logic that could lead to undefined behavior in a specific failure scenario. I've provided a detailed comment and a suggested fix for it. Once that is addressed, this PR will be in excellent shape.

src/ray/util/process.cc

src/ray/raylet/local_task_manager.cc

gemini-code-assist · 2025-07-23T14:08:10Z

Summary of Changes

This pull request aims to address the E2BIG error that can occur when starting worker processes, particularly when large runtime environments lead to excessively long command-line arguments. It introduces a new PopWorkerStatus to specifically handle this error and significantly improves the error propagation and process management mechanisms for worker startup failures, providing more detailed error messages to the user.

Highlights

Error Handling for E2BIG: Introduced PopWorkerStatus::ArgumentListTooLong to specifically identify and handle cases where worker processes fail to start due to command-line arguments being too long (E2BIG error), providing a more granular failure status.
Enhanced Worker Startup Error Propagation: Modified the worker pool's callback mechanisms to pass a detailed runtime_env_setup_error_message when worker startup fails, allowing for more informative error reporting to the user about the cause of the failure.
Refactored Process Spawning Logic: The underlying process spawning utility (ProcessFD::Spawn) has been significantly refactored to robustly capture and propagate execvpe errors from child processes back to the parent, improving reliability and debuggability of worker startup failures.
Improved Parent-Child Communication: Implemented a new pipe-based mechanism for the parent process to receive the child's PID and any execvpe error codes, addressing previous issues with ignored return values and potential zombie processes, especially for decoupled worker processes.

Changelog

src/ray/raylet/local_task_manager.cc
- Tasks are now cancelled for ArgumentListTooLong status in addition to RuntimeEnvCreationFailed.
- Updated comments to reflect handling of general worker startup failures.
src/ray/raylet/worker_pool.cc
- PopWorkerCallbackAsync and PopWorkerCallbackInternal now accept an error message string.
- StartWorkerProcess now explicitly handles E2BIG errors and other process startup failures, setting appropriate PopWorkerStatus and error messages.
- StartProcess signature updated to pass std::error_code by reference.
- Specific error message provided for ArgumentListTooLong status when calling PopWorkerCallbackAsync.
src/ray/raylet/worker_pool.h
- Added ArgumentListTooLong enum value to PopWorkerStatus with a description.
- Updated function signatures for StartProcess, PopWorkerCallbackInternal, and PopWorkerCallbackAsync to include new error handling parameters.
src/ray/raylet/worker_pool_test.cc
- Mock functions updated to match new parameter lists for PopWorkerCallbackAsync and StartProcess.
src/ray/util/process.cc
- Introduced ReadBytesFromFd helper for robust file descriptor reading.
- Major refactoring of ProcessFD::Spawn to use a status_pipe for child process PID and error communication.
- Improved error handling for fork and pipe system calls.
- Revised logic for decoupled processes to correctly retrieve grandchild PIDs and propagate execvpe errors.

Activity

The pull request author, mjacar, initiated this "Work In Progress" PR to fix an E2BIG problem.
gemini-code-assist[bot] has provided extensive automated review feedback, identifying numerous issues.
Critical Issues Identified:
- Regression in spawnvpe for decoupled processes, leading to incorrect PID, broken lifetime tracking, and race conditions in error reporting (Comment 2223710079).
- Regression in pipe_to_stdin logic in spawnvpe (missing dup2 call) (Comment 2223762118).
- Incorrect RAY_CHECK in PopWorkerCallbackAsync that would cause crashes (Comment 2223771630).
- Initial implementation of StartProcess had ec checked before assignment, missing call, and extraneous brace (Comment 2223839327).
- Potential resource leak if pipe fails in ProcessFD::Spawn (Comment 2223830064).
- Potential correctness issue with partial reads for err_from_child in non-decoupled processes, leading to undefined behavior (Comment 2225689234).
High Priority Issues Identified:
- Loss of original errno for decoupled processes, reporting ECHILD instead of the actual error (Comment 2223784017).
- Ignored write return value in child process when reporting execvpe error (Comment 2223640494).
- Suggested logging error and returning failure status instead of FATAL for other errors in StartWorkerProcess (Comment 2223852325).
Medium Priority Issues Identified:
- Numerous indentation inconsistencies (tabs vs. spaces) across multiple files (worker_pool.cc, worker_pool.h).
- Suggestions for more robust read checks, simplifying variable initialization, and using std::move.
- Recommendations for adding more specific log messages and updating comments for clarity.
- Concerns about ignored waitpid return values and potential zombie processes, with detailed explanations provided by the bot (Comment 2223640529).
mjacar has been actively engaging with the bot's feedback, asking for specific code suggestions and clarifications on the identified issues.
The PR is still marked as "WIP" and has several critical and high-priority issues that need to be addressed.

mjacar · 2025-07-23T17:10:03Z

@edoakes @MortalHappiness You guys seem to be the last people to touch process.cc so maybe you would at least know who to forward this to for code review.

jjyao · 2025-07-23T19:28:21Z

@mjacar thanks for the very detailed write-up and the contribution!!

edoakes

Looks great overall, thanks for the contribution.

Could you try adding a Python integration test for this? It can look similar to the reproduction script you provided. I'd suggest adding it to

ray/python/ray/tests/test_runtime_env.py

Line 4 in c006d81

put the test in `test_runtime_env_standalone.py`.

src/ray/raylet/local_task_manager.cc

src/ray/raylet/worker_pool.cc

src/ray/raylet/worker_pool.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com>

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

mjacar · 2025-07-23T23:23:34Z

@edoakes Thanks for taking a look. I added a test in test_runtime_env.py like you asked.

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/util/process.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

src/ray/raylet/local_task_manager.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/util/process.cc Outdated Show resolved Hide resolved

src/ray/util/process.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/util/process.cc Outdated Show resolved Hide resolved

src/ray/raylet/local_task_manager.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

src/ray/raylet/local_task_manager.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

src/ray/util/process.cc Outdated Show resolved Hide resolved

src/ray/raylet/local_task_manager.cc Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved

src/ray/util/process.cc Outdated Show resolved Hide resolved

mjacar force-pushed the fix-e2big-problem branch 2 times, most recently from b399d15 to 7fa2d59 Compare July 23, 2025 03:00

mjacar and others added 17 commits July 23, 2025 05:59

Fix E2BIG problem

e2a335f

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/util/process.cc

dbd6813

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.cc

c329b6f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.cc

e26f6fe

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Remove debugging line

41a23f1

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Add more logging for potential zombie processes

7c986cf

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

[core] combine python protobuf compiles rules (ray-project#54810)

221abf0

just use serve and core; stop pretending that we have separate packages. removes all unused python compile rules Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

[core] Skip test_object_assign_owner in client mode (ray-project#54826

15dfd9c

) Experimental feature we're in the process of deprecating; skip the flaky test. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.cc

1474511

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.h

680192b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.h

34b6a70

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/raylet/worker_pool.h

ec669af

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Try hybrid approach

c088f34

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Add dup2 call

5aa46e8

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

mjacar and others added 11 commits July 23, 2025 05:59

get rid of extra brace

cbe635f

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Update src/ray/util/process.cc

f363ad0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

More robust reads

87e8070

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Linter

e01f3ec

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

[serve.llm] Skip batching logic if batch_interval_ms == 0 (ray-proj…

587a56b

…ect#54751) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

[dashboard] fix typos (ray-project#54550)

d85e0ad

Signed-off-by: Jan Hilberath <jan@hilberath.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Potential fixes

b6fc13c

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Potential fix

1053dfc

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Potential fix

e424d43

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

More robust

d09aed0

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

mjacar force-pushed the fix-e2big-problem branch from b37318f to d09aed0 Compare July 23, 2025 05:59

mjacar added 2 commits July 23, 2025 02:00

Merge branch 'master' into fix-e2big-problem

76d8c00

Merge branch 'master' into fix-e2big-problem

a1530f5

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

src/ray/util/process.cc Show resolved Hide resolved

src/ray/raylet/local_task_manager.cc Show resolved Hide resolved

mjacar marked this pull request as ready for review July 23, 2025 17:06

mjacar requested a review from a team as a code owner July 23, 2025 17:06

mjacar changed the title ~~WIP: Fix E2BIG problem~~ [core][bug] Fix infinite loop on E2BIG error when spawning a process Jul 23, 2025

ray-gardener bot added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 23, 2025

edoakes reviewed Jul 23, 2025

View reviewed changes

mjacar and others added 2 commits July 23, 2025 18:06

Update src/ray/raylet/local_task_manager.cc

d786a88

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michaeljacar@protonmail.com>

Code review

131b55a

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

mjacar force-pushed the fix-e2big-problem branch from 7f19859 to 131b55a Compare July 23, 2025 23:20

mjacar and others added 2 commits July 24, 2025 03:56

Comments

d038bea

Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Merge branch 'master' into fix-e2big-problem

9652d44

[core][bug] Fix infinite loop on E2BIG error when spawning a process #54838

Are you sure you want to change the base?

[core][bug] Fix infinite loop on E2BIG error when spawning a process #54838

Conversation

mjacar commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Summary

Reproduction Script

Detailed Write-Up

The process.cc Bug: A Startup Race Condition

The Fix: A Robust Communication Protocol

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjacar commented Jul 22, 2025 •

edited

Loading

The `process.cc` Bug: A Startup Race Condition