Skip to content

Conversation

mikeshng
Copy link
Contributor

  • One-line PR description: Add a new KEP to introduce the Placement Decision API for multicluster scheduling
  • Other comments:

/sig multicluster

@k8s-ci-robot k8s-ci-robot added sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 17, 2025
@k8s-ci-robot k8s-ci-robot requested review from JeremyOT and skitt May 17, 2025 23:48
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mikeshng. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 17, 2025
@mikeshng
Copy link
Contributor Author

@k8s-ci-robot
Copy link
Contributor

@mikeshng: GitHub didn't allow me to assign the following users: zhiying-lin.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @deads2k @RainbowMango @zhiying-lin

CC @corentone @elgnay @haoqing0110 @jnpacker @qiujian16 @ryanzhang-oss

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@iholder101
Copy link
Contributor

/cc @awels
FYI

@k8s-ci-robot
Copy link
Contributor

@iholder101: GitHub didn't allow me to request PR reviews from the following users: awels.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @awels
FYI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mikeshng mikeshng force-pushed the placement-decision-api branch from 9ca10ab to 3406d3d Compare May 19, 2025 16:02
Copy link
Contributor

@corentone corentone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to simplify it a bit.

At the same time, will try to suggest sharing our MCO one as the placement.

@mikeshng mikeshng force-pushed the placement-decision-api branch from 3406d3d to 881956f Compare May 25, 2025 15:34
@mikeshng mikeshng force-pushed the placement-decision-api branch 4 times, most recently from 280c3b3 to 971facb Compare May 27, 2025 21:56
@mikeshng
Copy link
Contributor Author

mikeshng commented Jun 3, 2025

Closed threads I believe are resolved. Feel free to reopen or comment if there's more to discuss. Thanks!

@lauralorenz
Copy link
Contributor

Triage notes: Now waiting for @skitt and @JeremyOT feedback as recent community comments have been addressed and we feel it is ready for you all to look at!

@mikeshng mikeshng force-pushed the placement-decision-api branch from 971facb to f04fa5a Compare June 3, 2025 22:51
@skitt
Copy link
Member

skitt commented Jun 16, 2025

This looks good to me. There are a few typos etc. but we can take care of that later (I’ll follow up). I take it we’ll revisit the graduation criteria, test plans etc. after the initial merge, is that right?

/lgtm

@JeremyOT ping

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2025
@skitt
Copy link
Member

skitt commented Jun 16, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 16, 2025
@mikeshng mikeshng force-pushed the placement-decision-api branch from f04fa5a to 272a8e6 Compare June 16, 2025 17:24
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mikeshng
Once this PR has been reviewed and has the lgtm label, please ask for approval from skitt. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mikeshng
Copy link
Contributor Author

This looks good to me. There are a few typos etc. but we can take care of that later (I’ll follow up). I take it we’ll revisit the graduation criteria, test plans etc. after the initial merge, is that right?

Right, thanks @skitt !

Just pushed the update-toc check fix.


### Terminology

- **Placement**: A scheduler request that asks "where should this workload run?".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to formalize what a Placement request is before it makes sense to make decisions implementation agnostic?

As is, it seems like you can't really swap out the consumer because it needs to know what the Placement meant to the scheduler references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KEP to acknowledge this limitation.
While consumers may still need scheduler specific knowledge for complex scenarios, this API still provides value by standardizing the "where" (cluster list) output, enabling basic workload distribution and reducing integration complexity even without full placement request standardization.
WDYT Jeremy?

Copy link
Contributor

@ryanzhang-oss ryanzhang-oss Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second to Mike. The value of this API is that it provides a single interface for any projects that have a multi-cluster component to out source the scheduling part. For example, Argo's applicationSet can take a cluster generator, Multi-kueue now supports external scheduling. One thing in common between those projects are they all assume that there is an external scheduling component instead of trying to do it in their own project. With the schedulingDecision API, we can now implement a common controller that allows those project to tap into the scheduling capabilities that cluster managers projects (OCM/KubeFleet/Karma.. etc) provide. I think there is clear value in it alone.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel Ryan has hit the bullseye, that this represent a common scheduling decision for any project to consume, where they don't necessarily need to be concerned with the implementation. The end result being, these projects or the end consumer could pick a scheduler or multi-cluster implementation that suites their needs, and still have it work with the consumer. ArgoCD ApplicationSets etc... being some of those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-post to comment in API example here: https://github.com/kubernetes/enhancements/pull/5314/files#r2304565590

Regarding this part in the KEP:

Placement: A scheduler request that asks "where should this workload run?". Note: This KEP does not standardize the Placement request format itself, only the PlacementDecision output. Consumers may still need scheduler specific knowledge to fully understand placement intent, though basic workload distribution can be achieved by simply deploying to the clusters listed in decisions.

and these comments from this thread:

While consumers may still need scheduler specific knowledge for complex scenarios

this represent a common scheduling decision for any project to consume, where they don't necessarily need to be concerned with the implementation.

I think the comment from @skitt linked above brings up that even in the basic case, without knowing more about the Placement object in this KEP, how can a consumer be using this to out sourcing placement fully since it will need to know about the work it submitted for placement?

// ClusterDecision references a target ClusterProfile for placement.
type ClusterDecision struct {
// Reference to the target ClusterProfile.
ClusterProfileRef ClusterProfileRef `json:"clusterProfileRef"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectReference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated to ObjectReference as suggested.

The scheduler may choose to populate the reason for each decision for consumers/end-users
(ie, for debugging purposes).

- **Update / Reschedule**: The scheduler may add or remove clusters in decisions at any time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What action is the consumer expected to take on change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added actions and examples on change.

When the cluster set itself has not changed, this stable ordering produces an identical set of clusters,
so the API server skips the write and no extra change events reach consumers.

- **Delete**: When a placement is no longer required,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that a placement is no longer required? What change triggers this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated explanation of what it means to be no longer required and examples of toggles.

Signed-off-by: Mike Ng <ming@redhat.com>
@mikeshng mikeshng force-pushed the placement-decision-api branch from 272a8e6 to a9ba48c Compare August 11, 2025 18:30
@lauralorenz
Copy link
Contributor

Triage notes:

  • Conversation right now is about the origin position of having a placement decision API without a placement request API.
  • For now added it as a non-goal -- is that enough?
  • If not: at least looking for more motivation on the limits and scope of the placement decision API by itself without the placement request part
  • Want to determine if this KEP can continue forward with the "one half" of the situation (response part without request)

Comment on lines +231 to +232
apiGroup: multicluster.x-k8s.io
kind: Placement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this example be reworked to avoid relying on an object that hasn’t been defined yet, as far as I’m aware? If this KEP were approved, what would implementations look like? Are there common characteristics that an object referenced here would have?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a non-scheduler-specific or aware consumer use this without knowledge of the Placement object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.