Skip to content

Conversation

@alimaazamat
Copy link
Contributor

@alimaazamat alimaazamat commented Oct 30, 2025

Why are these changes needed?

V2 Autoscaling allows for configuring idleTimeoutSeconds per worker group, which needs added unit + e2e testing
EDIT: e2e testing already done in #2725 issue should've been closed

Related issue number

#2561

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@marosset marosset self-requested a review October 30, 2025 21:52
@alimaazamat alimaazamat changed the title test: add unit and e2e tests for idleTimeoutSeconds config per worker… test: add unit tests for idleTimeoutSeconds config per worker… Oct 30, 2025
@alimaazamat alimaazamat changed the title test: add unit tests for idleTimeoutSeconds config per worker… test: unit tests for idleTimeoutSeconds config per worker groups Oct 30, 2025
@marosset
Copy link

LGTM

@andrewsykim
Copy link
Member

@ryanaoleary can you review since you added the feature?

@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch 2 times, most recently from a4bc057 to ae88086 Compare November 13, 2025 19:45
Copy link
Collaborator

@ryanaoleary ryanaoleary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested some of these cases manually and it looks correct.

@andrewsykim
Copy link
Member

andrewsykim commented Nov 13, 2025

@alimaazamat the e2e buildkite failure seems related? Can you take a look https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/11598#019a7efb-eeee-48fd-8614-6bd5d5311f4e

Triggered a retry incase it's a flake

@alimaazamat
Copy link
Contributor Author

@alimaazamat the e2e buildkite failure seems related? Can you take a look https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/11598#019a7efb-eeee-48fd-8614-6bd5d5311f4e

@andrewsykim Yeah I reran it already previously and it did fail again. Because this are just unit tests I thought it wouldn't impact e2e?

@alimaazamat
Copy link
Contributor Author

Actually I think I found the issue:
Seems like part2 autoscaler tests are using env var only and not the spec version.

@alimaazamat
Copy link
Contributor Author

alimaazamat commented Nov 14, 2025

@andrewsykim
In buildkite the only errors I'm seeing are reconcillation errors where worker pod status is failed during teardown. https://github.com/alimaazamat/kuberay/blob/idleTimeoutSecondsTests/ray-operator/controllers/ray/raycluster_controller.go#L173 is the only time we call our ValidateRayClusterSpec changes and if it is invalid, the controller would return immediately and wouldn't even hit the errors? Let me know if my thinking is on the wrong track.

Comment on lines 45 to 46
WithAutoscalerOptions(rayv1ac.AutoscalerOptions().
WithVersion(rayv1.AutoscalerVersionV2)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WithAutoscalerOptions(rayv1ac.AutoscalerOptions().
WithVersion(rayv1.AutoscalerVersionV2)).

Are these required? The tests[1] used here already has autoscaler v2 enabled.

Copy link
Contributor Author

@alimaazamat alimaazamat Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even thought tests[1] has v2 enabled as an env variable, but doesn't actually change the spec without

WithAutoscalerOptions(rayv1ac.AutoscalerOptions().
		WithVersion(rayv1.AutoscalerVersionV2)).

I wanted it updated in the spec as well for the validation tests I wrote, since in validation.go the other tests reference the spec not env vars e.g. workerGroup.Template.Spec.RestartPolicy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also get
The RayCluster spec is invalid test-ns-k5rcx/ray-cluster: worker group short-idle-timeout-group has idleTimeoutSeconds set, but autoscaler version is not v2. Please set .spec.autoscalerOptions.version to v2 |

@alimaazamat alimaazamat changed the title test: unit tests for idleTimeoutSeconds config per worker groups test: validation tests for idleTimeoutSeconds config per worker groups Dec 3, 2025
@alimaazamat alimaazamat changed the title test: validation tests for idleTimeoutSeconds config per worker groups test: idleTimeoutSeconds config per worker groups validation for RayClusters Dec 3, 2025
@alimaazamat alimaazamat changed the title test: idleTimeoutSeconds config per worker groups validation for RayClusters Add validation to require RayCluster v2 when using idleTimeoutSeconds Dec 3, 2025
@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch 2 times, most recently from 4475ce5 to 3fada64 Compare December 4, 2025 00:24
Copy link

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

@marosset marosset left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch from c9677d9 to 7be6c82 Compare December 4, 2025 22:15
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch from 7be6c82 to a608d94 Compare December 4, 2025 22:17
@alimaazamat
Copy link
Contributor Author

alimaazamat commented Dec 4, 2025

@andrewsykim @ryanaoleary @rueian Changed validation to use env var for autoscaler v2 but according to this: #3578 and this comment #3578 (comment) I think we prefer using spec version not the env var. I am opening up a separate PR to change tests[1] to use the spec version instead of the env var.
EDIT: changed behavior to prefer spec and fall back to env var depending on Kuberay version based on documentation https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html

@alimaazamat alimaazamat changed the title Add validation to require RayCluster v2 when using idleTimeoutSeconds [Autoscaler] Add validation to require RayCluster v2 when using idleTimeoutSeconds Dec 5, 2025
Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, validating env var looks better for me. This also supports users of KubeRay versions < 1.4.0, which do not allow using the spec version. Thank you!

@win5923 win5923 self-requested a review December 5, 2025 06:33
@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch from ded1e68 to db7a2c6 Compare December 5, 2025 18:45
Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com>
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
@alimaazamat alimaazamat force-pushed the idleTimeoutSecondsTests branch from db7a2c6 to d76a861 Compare December 5, 2025 19:53
@alimaazamat
Copy link
Contributor Author

alimaazamat commented Dec 5, 2025

First check if spec version is set with IsAutoscalingV2Enabled, if not then fall back to using env var. This works with these instructions in docs: https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html
If you’re using KubeRay >= 1.4.0, enable V2 by setting RayCluster.spec.autoscalerOptions.version: v2.
If you’re using KubeRay < 1.4.0, enable V2 by setting the RAY_enable_autoscaler_v2 environment variable in the head and using restartPolicy: Never on head and all worker groups.
Updated tests to reflect this validation behavior as well.

@alimaazamat alimaazamat requested review from rueian and win5923 December 5, 2025 22:09
Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! cc @rueian for final review.

Comment on lines +107 to +109
if err := validateWorkerGroupIdleTimeout(workerGroup, spec); err != nil {
return err
}
Copy link
Member

@Future-Outlier Future-Outlier Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we put the logic to here?

    if IsAutoscalingV2Enabled(spec) {
    if spec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && spec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever {
    return fmt.Errorf("restartPolicy for head Pod should be Never or unset when using autoscaler V2")
    }
    for _, workerGroup := range spec.WorkerGroupSpecs {
    if workerGroup.Template.Spec.RestartPolicy != "" && workerGroup.Template.Spec.RestartPolicy != corev1.RestartPolicyNever {
    return fmt.Errorf("restartPolicy for worker group %s should be Never or unset when using autoscaler V2", workerGroup.GroupName)
    }
    }
    }
    }

  2. Can we also add validation logic for AutoscalerOptions.IdleTimeoutSeconds?

    // IdleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    // Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler.
    // +optional
    IdleTimeoutSeconds *int32 `json:"idleTimeoutSeconds,omitempty"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Future-Outlier Why would we put the validation logic in validateWorkerGroupIdleTimeout? Above it we have other validation functions like validateRayGroupLabels and validateRayGroupResources that are structured the same way: where we define those functions seperate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For validation logic for AutoscalerOptions.IdleTimeoutSeconds I agree, let's do that in a seperate PR. This one is scoped out for worker groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants