Skip to content

Conversation

@zhuhan0
Copy link
Contributor

@zhuhan0 zhuhan0 commented Aug 15, 2025

Summary:
CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using os.fork()), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via fork(). It's recommended to use spawn or forkserver.

Among the two, forkserver needs to be use carefully and specifically, it's recommended to call multiprocessing.set_start_method('forkserver') at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - post.

It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - code. Among the spawn and forkserver start methods, spawn is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing.

Differential Revision: D80305233

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80305233

zhuhan0 added a commit to zhuhan0/torchrec that referenced this pull request Aug 15, 2025
Summary:

CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using `os.fork()`), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via `fork()`. It's recommended to use `spawn` or `forkserver`.

Among the two, `forkserver` needs to be use carefully and specifically, it's recommended to call `multiprocessing.set_start_method('forkserver')` at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - [post](https://fb.workplace.com/groups/319878845696681/posts/1494595861558301).

It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - [code](https://fburl.com/code/27naz2eg).  Among the `spawn` and `forkserver` start methods, `spawn` is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing.

Reviewed By: adamomainz, weifengpy

Differential Revision: D80305233
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80305233

zhuhan0 added a commit to zhuhan0/torchrec that referenced this pull request Aug 15, 2025
Summary:
Pull Request resolved: meta-pytorch#3284

CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using `os.fork()`), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via `fork()`. It's recommended to use `spawn` or `forkserver`.

Among the two, `forkserver` needs to be use carefully and specifically, it's recommended to call `multiprocessing.set_start_method('forkserver')` at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - [post](https://fb.workplace.com/groups/319878845696681/posts/1494595861558301).

It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - [code](https://fburl.com/code/27naz2eg).  Among the `spawn` and `forkserver` start methods, `spawn` is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing.

Reviewed By: adamomainz, weifengpy

Differential Revision: D80305233
@zhuhan0 zhuhan0 force-pushed the export-D80305233 branch 2 times, most recently from c9a654b to 528871e Compare August 15, 2025 22:09
zhuhan0 added a commit to zhuhan0/torchrec that referenced this pull request Aug 15, 2025
Summary:

CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using `os.fork()`), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via `fork()`. It's recommended to use `spawn` or `forkserver`.

Among the two, `forkserver` needs to be use carefully and specifically, it's recommended to call `multiprocessing.set_start_method('forkserver')` at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - [post](https://fb.workplace.com/groups/319878845696681/posts/1494595861558301).

It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - [code](https://fburl.com/code/27naz2eg).  Among the `spawn` and `forkserver` start methods, `spawn` is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing.

Reviewed By: adamomainz, weifengpy

Differential Revision: D80305233
Summary:
Pull Request resolved: meta-pytorch#3284

CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using `os.fork()`), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via `fork()`. It's recommended to use `spawn` or `forkserver`.

Among the two, `forkserver` needs to be use carefully and specifically, it's recommended to call `multiprocessing.set_start_method('forkserver')` at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - [post](https://fb.workplace.com/groups/319878845696681/posts/1494595861558301).

It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - [code](https://fburl.com/code/27naz2eg).  Among the `spawn` and `forkserver` start methods, `spawn` is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing.

Reviewed By: adamomainz, weifengpy

Differential Revision: D80305233
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80305233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants