Skip to content

Conversation

@clumsy
Copy link
Collaborator

@clumsy clumsy commented Apr 23, 2025

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@kiukchung
Copy link
Contributor

could you provide more context to what you want to achieve with this?

@clumsy
Copy link
Collaborator Author

clumsy commented Apr 23, 2025

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

@kiukchung
Copy link
Contributor

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

@clumsy
Copy link
Collaborator Author

clumsy commented May 1, 2025

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

@kiukchung
Copy link
Contributor

kiukchung commented May 1, 2025

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (https://github.com/pytorch/torchx/blob/9120355959fef8eb438226e9521a07a00e02078e/torchx/util/entrypoints.py#L54)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

@clumsy
Copy link
Collaborator Author

clumsy commented May 13, 2025

Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly.

clumsy pushed a commit to clumsy/torchx that referenced this pull request May 13, 2025
@clumsy
Copy link
Collaborator Author

clumsy commented May 13, 2025

Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080

The existing commands continue to work, e.g.:

  • torchx runopts
my_package.custom_local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[debug=DEBUG],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        debug=DEBUG (bool, False)
            run a container with noop entrypoint, useful for debugging environment
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.
  • torchx run -s my_package.custom_local_cwd
usage: torchx run [-h]
                  [-s {local_docker,local_cwd,slurm,kubernetes,kubernetes_mcad,aws_batch,aws_sagemaker,gcp_batch,ray,lsf,my_package.custom_aws_batch,my_package.aws_sagemaker,my_package.custom_gcp_batch,my_package.kubernetes,my_package.kubernetes_mcad,my_package.custom_local_cwd,my_package.local_docker,my_package.lsf,my_package.custom_aws_batch,my_package.custom_local_docker,}]
  • torchx run -s my_package.custom_local_cwd utils.echo --msg "test"
torchx 2025-05-13 12:52:15 INFO     Tracker configurations: {}
torchx 2025-05-13 12:52:15 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-05-13 12:52:15 INFO     Log directory is: /var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg
my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     Launched app: my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles:
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg/torchx/echo-gww0rzcc0z72tc

@facebook-github-bot
Copy link
Contributor

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@clumsy
Copy link
Collaborator Author

clumsy commented Jun 6, 2025

The failure seems unrelated to the PR: E RuntimeError: operator torchvision::nms does not exist
@kiukchung Should we rerun the tests?

@clumsy
Copy link
Collaborator Author

clumsy commented Jun 18, 2025

Hi @kiukchung, should I do a bogus push to trigger new test run? The last failure seem to have been unrelated.

@clumsy
Copy link
Collaborator Author

clumsy commented Aug 22, 2025

Hi @kiukchung does this PR need more work or is there anything I can do to help integrate the change?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 3, 2025

Sorry for bugging, but could you please let me know if I can help getting this one merged, @kiukchung , @d4l3k , @tonykao8080 , @andywag ? Thanks!

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 57ec248 to 8f0c6b0 Compare October 24, 2025 23:18
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

I just realized this was never merged. Does it need more work, @kiukchung?

@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.61%. Comparing base (7abb9a2) to head (05924eb).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1050   +/-   ##
=======================================
  Coverage   91.61%   91.61%           
=======================================
  Files          84       84           
  Lines        6583     6583           
=======================================
  Hits         6031     6031           
  Misses        552      552           
Flag Coverage Δ
unittests 91.61% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 8f0c6b0 to 36e77f7 Compare October 25, 2025 01:26
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 25, 2025

Hm, not sure why The Diff and Pull Request are not in sync! check seem to be failing here, yet All checks passed, this Pull Request can be imported!, @kiukchung

I wonder if it's related to the new permissions. I rebased on top of upstream main and pushed to my fork like I always do :/
Perhaps it's because it's not approved?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 27, 2025

Is there anything I can do here, @kiukchung?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 29, 2025

Please have a look when you get a chance, @kiukchung
Thanks!

@kiukchung
Copy link
Contributor

Please have a look when you get a chance, @kiukchung Thanks!

@clumsy sorry I didn't get to this sooner! Will take a closer look tomorrow morning. Just want to make sure everything is BC!

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 30, 2025

Appreciate your help, @kiukchung!

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from e9891d2 to cfa412b Compare October 30, 2025 19:28
@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from cfa412b to 05924eb Compare October 30, 2025 19:31
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 30, 2025

Fixed the BC issues in the new revision and added more unit tests @kiukchung

Comment on lines +51 to +56
schedulers = load_group(group, default={})
if not skip_defaults:
for name, path in DEFAULT_SCHEDULER_MODULES.items():
if name not in schedulers:
schedulers[name] = _defer_load_scheduler(path)
return schedulers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing one of the BC incompatibilities. But I think the other one still remains. get_scheduler_factories(skip_defaults=False) now returns BOTH my custom [torchx.schedulers] mappings AND the DEFAULT_SCHEDULER_MODULES which is not ideal for users who expect ONLY my custom schedulers to show. This is important since the default schedulers may not even be available for the users to use.

Unlike named_resources, scheduler_factories is not additive. We should've kept the same behavior in both (either additive or not).

skip_defaults argument in get_scheduler_factories() means "even if you don't find anything in entry-points, do not return schedulers in DEFAULT_SCHEDULER_MODULES. I think the confusion is that DEFAULT_SCHEDULER_MODULES is a badly worded constant. What it "should've been called are BUILTIN_SCHEDULER_MODULES (aka the ones that torchx has "builtin" support for) and skip_defaults should've been called skip_builtins.

If the motivation here is for the CLI to expose custom + builtins such that builtins is always in-sync with any new schedulers added to torchx, we are currently in a mode where we are retiring (not adding) schedulers. The goal being better support/integration rather than wider coverage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to be able to allow registering custom schedulers next to built-in ones. E.g. if NeMo wants to register some new ones or even overwrite some of them - the rest should be available, e.g. local_cwd. It's not practical to ask the user to adding all built-ins manually - we are not even sure in what order these will be resolved, e.g. both NeMo and some other package registering their own. The simplest idea was to allow additive behavior if we are not using the same names. I guess we can preserve the existing "clear+add" approach as well @kiukchung

@meta-codesync
Copy link

meta-codesync bot commented Oct 31, 2025

@kiukchung has imported this pull request. If you are a Meta employee, you can view this in D74743585.

Copy link
Contributor

@tonykao8080 tonykao8080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants