Skip to content

Conversation

kiukchung
Copy link
Contributor

Summary:
compute_world_size is run as an integration test in a -j 2x2 configuration using torchrun which sets MASTER_ADDR and MASTER_PORT. However, it was ignoring those env vars and overriding them with the ones in the hydra config (added to make compute_world_size work as a single process without torchrun).

Integ tests are failing in CI because localhost:0 (pick random free port) is used as the MASTER_ADDR:MASTER_PORT on all 4 workers, hence all 4 workers are deadlocked waiting for each other to join the job.

This diff fixes this by only setting the env vars if one is not already set.

Differential Revision: D71919903

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71919903

…ot respecting env vars set by torchrun (#1029)

Summary:

`compute_world_size` is run as an integration test in a `-j 2x2` configuration using `torchrun` which sets `MASTER_ADDR` and `MASTER_PORT`. However, it was ignoring those env vars and overriding them with the ones in the hydra config (added to make `compute_world_size` work as a single process without `torchrun`).

Integ tests are failing in CI because `localhost:0` (pick random free port) is used as the `MASTER_ADDR:MASTER_PORT` on all 4 workers, hence all 4 workers are deadlocked waiting for each other to join the job.

This diff fixes this by only setting the env vars if one is not already set.

Differential Revision: D71919903
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71919903

Copy link
Contributor

@tonykao8080 tonykao8080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@facebook-github-bot facebook-github-bot merged commit 01f987d into main Mar 26, 2025
21 of 23 checks passed
@facebook-github-bot facebook-github-bot deleted the export-D71919903 branch March 26, 2025 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants