Skip to content

Conversation

andywag
Copy link
Contributor

@andywag andywag commented Apr 28, 2025

Summary:
TorchX has been handling ModuleNotFoundError gracefully for a while now, e.g. for SageMaker when running torchx runopts we get:

...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...

But for ray we get an exception after which we won't get next runopts:

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined

That's because ray_scheduler has custom ModuleNotFoundException handling - perhaps for historic reasons.

Test Plan: [x] existing test must pass

Differential Revision: D73751531

Pulled By: andywag

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73751531

Summary:

TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:

```
...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...
```

But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```

That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.


Test Plan: [x] existing test must pass

Reviewed By: tonykao8080

Differential Revision: D73751531

Pulled By: andywag
@andywag andywag force-pushed the export-D73751531 branch from 5526463 to c489694 Compare May 1, 2025 04:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73751531

@andywag andywag closed this May 1, 2025
@andywag andywag reopened this May 1, 2025
@facebook-github-bot facebook-github-bot merged commit 9120355 into meta-pytorch:main May 1, 2025
40 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants