feat: list all registered schedulers (#1009) #1050

clumsy · 2025-04-23T15:26:30Z

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

kiukchung · 2025-04-23T20:00:12Z

could you provide more context to what you want to achieve with this?

clumsy · 2025-04-23T20:33:30Z

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

kiukchung · 2025-04-30T17:58:40Z

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

clumsy · 2025-05-01T12:26:35Z

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

kiukchung · 2025-05-01T19:29:47Z

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (https://github.com/pytorch/torchx/blob/9120355959fef8eb438226e9521a07a00e02078e/torchx/util/entrypoints.py#L54)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

clumsy · 2025-05-13T16:48:43Z

Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly.

clumsy · 2025-05-13T17:14:29Z

Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080

The existing commands continue to work, e.g.:

torchx runopts

my_package.custom_local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[debug=DEBUG],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        debug=DEBUG (bool, False)
            run a container with noop entrypoint, useful for debugging environment
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

torchx run -s my_package.custom_local_cwd

usage: torchx run [-h]
                  [-s {local_docker,local_cwd,slurm,kubernetes,kubernetes_mcad,aws_batch,aws_sagemaker,gcp_batch,ray,lsf,my_package.custom_aws_batch,my_package.aws_sagemaker,my_package.custom_gcp_batch,my_package.kubernetes,my_package.kubernetes_mcad,my_package.custom_local_cwd,my_package.local_docker,my_package.lsf,my_package.custom_aws_batch,my_package.custom_local_docker,}]

torchx run -s my_package.custom_local_cwd utils.echo --msg "test"

torchx 2025-05-13 12:52:15 INFO     Tracker configurations: {}
torchx 2025-05-13 12:52:15 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-05-13 12:52:15 INFO     Log directory is: /var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg
my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     Launched app: my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles:
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg/torchx/echo-gww0rzcc0z72tc

facebook-github-bot · 2025-05-14T17:41:28Z

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

clumsy · 2025-06-06T18:41:51Z

The failure seems unrelated to the PR: E RuntimeError: operator torchvision::nms does not exist
@kiukchung Should we rerun the tests?

clumsy · 2025-06-18T22:35:54Z

Hi @kiukchung, should I do a bogus push to trigger new test run? The last failure seem to have been unrelated.

clumsy · 2025-08-22T12:47:02Z

Hi @kiukchung does this PR need more work or is there anything I can do to help integrate the change?

clumsy · 2025-10-03T19:27:41Z

Sorry for bugging, but could you please let me know if I can help getting this one merged, @kiukchung , @d4l3k , @tonykao8080 , @andywag ? Thanks!

clumsy · 2025-10-24T23:19:38Z

I just realized this was never merged. Does it need more work, @kiukchung?

codecov-commenter · 2025-10-24T23:21:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.61%. Comparing base (7abb9a2) to head (05924eb).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1050   +/-   ##
=======================================
  Coverage   91.61%   91.61%           
=======================================
  Files          84       84           
  Lines        6583     6583           
=======================================
  Hits         6031     6031           
  Misses        552      552

Flag	Coverage Δ
unittests	`91.61% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

clumsy · 2025-10-25T01:30:30Z

Hm, not sure why The Diff and Pull Request are not in sync! check seem to be failing here, yet All checks passed, this Pull Request can be imported!, @kiukchung

I wonder if it's related to the new permissions. I rebased on top of upstream main and pushed to my fork like I always do :/
Perhaps it's because it's not approved?

clumsy · 2025-10-27T18:28:19Z

Is there anything I can do here, @kiukchung?

clumsy · 2025-10-29T16:42:41Z

Please have a look when you get a chance, @kiukchung
Thanks!

kiukchung · 2025-10-30T01:39:08Z

Please have a look when you get a chance, @kiukchung Thanks!

@clumsy sorry I didn't get to this sooner! Will take a closer look tomorrow morning. Just want to make sure everything is BC!

clumsy · 2025-10-30T12:34:17Z

Appreciate your help, @kiukchung!

torchx/schedulers/__init__.py

clumsy · 2025-10-30T19:31:17Z

Fixed the BC issues in the new revision and added more unit tests @kiukchung

kiukchung · 2025-10-31T17:07:39Z

torchx/schedulers/__init__.py

+    schedulers = load_group(group, default={})
+    if not skip_defaults:
+        for name, path in DEFAULT_SCHEDULER_MODULES.items():
+            if name not in schedulers:
+                schedulers[name] = _defer_load_scheduler(path)
+    return schedulers


thanks for addressing one of the BC incompatibilities. But I think the other one still remains. get_scheduler_factories(skip_defaults=False) now returns BOTH my custom [torchx.schedulers] mappings AND the DEFAULT_SCHEDULER_MODULES which is not ideal for users who expect ONLY my custom schedulers to show. This is important since the default schedulers may not even be available for the users to use.

Unlike named_resources, scheduler_factories is not additive. We should've kept the same behavior in both (either additive or not).

skip_defaults argument in get_scheduler_factories() means "even if you don't find anything in entry-points, do not return schedulers in DEFAULT_SCHEDULER_MODULES. I think the confusion is that DEFAULT_SCHEDULER_MODULES is a badly worded constant. What it "should've been called are BUILTIN_SCHEDULER_MODULES (aka the ones that torchx has "builtin" support for) and skip_defaults should've been called skip_builtins.

If the motivation here is for the CLI to expose custom + builtins such that builtins is always in-sync with any new schedulers added to torchx, we are currently in a mode where we are retiring (not adding) schedulers. The goal being better support/integration rather than wider coverage.

The goal of this PR is to be able to allow registering custom schedulers next to built-in ones. E.g. if NeMo wants to register some new ones or even overwrite some of them - the rest should be available, e.g. local_cwd. It's not practical to ask the user to adding all built-ins manually - we are not even sure in what order these will be resolved, e.g. both NeMo and some other package registering their own. The simplest idea was to allow additive behavior if we are not using the same names. I guess we can preserve the existing "clear+add" approach as well @kiukchung

meta-codesync · 2025-10-31T17:08:57Z

@kiukchung has imported this pull request. If you are a Meta employee, you can view this in D74743585.

tonykao8080

Review automatically exported from Phabricator review in Meta.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025

clumsy pushed a commit to clumsy/torchx that referenced this pull request May 13, 2025

feat: allow registering entry points with prefixes (meta-pytorch#1050)

1d05766

clumsy force-pushed the feat/extend_schedulers_list branch from 57ec248 to 8f0c6b0 Compare October 24, 2025 23:18

clumsy force-pushed the feat/extend_schedulers_list branch from 8f0c6b0 to 36e77f7 Compare October 25, 2025 01:26

kiukchung reviewed Oct 30, 2025

View reviewed changes

torchx/schedulers/__init__.py Outdated Show resolved Hide resolved

torchx/schedulers/__init__.py Outdated Show resolved Hide resolved

clumsy force-pushed the feat/extend_schedulers_list branch from e9891d2 to cfa412b Compare October 30, 2025 19:28

feat: list all registered schedulers (meta-pytorch#1009)

05924eb

clumsy force-pushed the feat/extend_schedulers_list branch from cfa412b to 05924eb Compare October 30, 2025 19:31

kiukchung reviewed Oct 31, 2025

View reviewed changes

tonykao8080 approved these changes Nov 4, 2025

View reviewed changes

feat: list all registered schedulers (#1009) #1050

Are you sure you want to change the base?

feat: list all registered schedulers (#1009) #1050

Uh oh!

Conversation

clumsy commented Apr 23, 2025

Uh oh!

kiukchung commented Apr 23, 2025

Uh oh!

clumsy commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiukchung commented Apr 30, 2025

Uh oh!

clumsy commented May 1, 2025

Uh oh!

kiukchung commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clumsy commented May 13, 2025

Uh oh!

clumsy commented May 13, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

clumsy commented Jun 6, 2025

Uh oh!

clumsy commented Jun 18, 2025

Uh oh!

clumsy commented Aug 22, 2025

Uh oh!

clumsy commented Oct 3, 2025

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

codecov-commenter commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

clumsy commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clumsy commented Oct 27, 2025

Uh oh!

clumsy commented Oct 29, 2025

Uh oh!

kiukchung commented Oct 30, 2025

Uh oh!

clumsy commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

clumsy commented Oct 30, 2025

Uh oh!

kiukchung Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

clumsy Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Oct 31, 2025

Uh oh!

tonykao8080 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clumsy commented Apr 23, 2025 •

edited

Loading

kiukchung commented May 1, 2025 •

edited

Loading

codecov-commenter commented Oct 24, 2025 •

edited

Loading

clumsy commented Oct 25, 2025 •

edited

Loading