-
Notifications
You must be signed in to change notification settings - Fork 664
[Feature] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy #4185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
710166a to
d261b0b
Compare
|
Hi @andrewsykim, I followed you previous comments to adding a spec.upgradeStrategy API to RayCluster. But for now. I'm concerned this approach may introduce some issues:
Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT? |
Feature gates are used to gate features that are in early development and not ready for wider adoption, it shouldn't be used to change the behavior of RayCluster because it will eventually be on by default (and forced on). |
|
I think both of those concerns are valid, but I don't think this is a problem with separation of concerns as RayCluster is a building block for both RayService and RayJob. For those cases you mentioned, we should have validation to ensure RayCluster upgrade strategy cannot be set when used w/ RayJob and RayService |
05b8108 to
7109cf1
Compare
3d448e6 to
8bcce91
Compare
Signed-off-by: win5923 <ken89@kimo.com>
8bcce91 to
bf87764
Compare
c9d35b2 to
8d4c813
Compare
2c5166d to
9359c5f
Compare
|
Hi @andrewsykim, PTAL, thanks. |
Signed-off-by: win5923 <ken89@kimo.com>
99d114b to
acd7739
Compare
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: win5923 <ken89@kimo.com>
0b50555 to
73332af
Compare
Signed-off-by: win5923 <ken89@kimo.com>
73332af to
be2068f
Compare
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need some review help, cc @seanlaii @JiangJiaWei1103 @CheyuWu
dcbfe9a to
67152e9
Compare
Signed-off-by: win5923 <ken89@kimo.com>
67152e9 to
8c82e7b
Compare
Something to be mindful of with this strategy is that adding new fields or slightly changing some property of the template (e.g. changing default entrypoint), could trigger an unintended recreate when a user upgrades KubeRay. I think we ran into something similar when dealing with upgrade behavior in RayService, see #2320 for reference. |
| NewCluster RayServiceUpgradeType = "NewCluster" | ||
| // No new cluster will be created while the strategy is set to None | ||
| None RayServiceUpgradeType = "None" | ||
| RayServiceUpgradeNone RayServiceUpgradeType = "None" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also rename two constants above to start with RayService*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated in f3a9b81.
|
|
||
| const ( | ||
| // During upgrade, Recreate strategy will delete all existing pods before creating new ones | ||
| Recreate RayClusterUpgradeType = "Recreate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| Recreate RayClusterUpgradeType = "Recreate" | |
| RayClusterUpgradeRecreate RayClusterUpgradeType = "Recreate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated in f3a9b81
Signed-off-by: win5923 <ken89@kimo.com>
08fe6c5 to
f3a9b81
Compare
Thanks for pointing this out. I’ll add the |
Signed-off-by: win5923 <ken89@kimo.com>
a59d6f9 to
5f3afb3
Compare
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: Jun-Hao Wan <ken89@kimo.com>
f2e53cf to
9c06b5f
Compare
Why are these changes needed?
Currently, when users update a RayCluster spec (e.g., update the image), users must re-create the cluster or manually set spec.suspend to true and after all Pods are deleted and then set it back to false which is not particularly convenient for users deploying with GitOps systems like ArgoCD.
Ref:
Design doc: https://docs.google.com/document/d/1xQLm0-WQWD-FkufxBJYklOJGvVn4RLk0_vPjLD5ax7o/edit?usp=sharing
Changes
spec.upgradeStrategyfield to RayCluster CRDRecreate: During upgrade, Recreate strategy will delete all existing pods before creating new ones.None: No new pod will be created while the strategy is set to NoneImplementation
HeadGroupSpec.Templateto head pod andworkerGroup.Templateto worker pod annotations during creation withray.io/pod-template-hashI only compare the
HeadGroupSpec.TemplateandworkerGroup.Templatebecause these define the pod-related configurations. TheRayCluster.Speccontains many dynamic and component-specific settings (e.g., autoscaler configs, rayStartParams).Example:
Related issue number
Closes #2534 #3905
Checks