From 48869a61a3ae4ee901b37ca1682b56be2ae3c0be Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Thu, 9 Jan 2025 11:02:34 -0500 Subject: [PATCH 01/13] Add a proposal for integrating volume limits into cluster autoscaler --- .../5030-attach-limit-autoscaler/README.md | 807 ++++++++++++++++++ .../5030-attach-limit-autoscaler/kep.yaml | 50 ++ 2 files changed, 857 insertions(+) create mode 100644 keps/sig-storage/5030-attach-limit-autoscaler/README.md create mode 100644 keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/README.md b/keps/sig-storage/5030-attach-limit-autoscaler/README.md new file mode 100644 index 00000000000..7018679e98f --- /dev/null +++ b/keps/sig-storage/5030-attach-limit-autoscaler/README.md @@ -0,0 +1,807 @@ + +# KEP-NNNN: Your short, descriptive title + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +## Motivation + + + +### Goals + + + +### Non-Goals + + + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml new file mode 100644 index 00000000000..49fbc9d9d02 --- /dev/null +++ b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml @@ -0,0 +1,50 @@ +title: Integrate CSI Volume attach limits with cluster autoscaler +kep-number: 5030 +authors: + - "@gnufied" +owning-sig: sig-storage +participating-sigs: + - sig-storage + - sig-scheduling + - sig-autoscaling +status: provisional +creation-date: 2025-01-09 +reviewers: + - TBD + - "@jsafrane" + - "@msau42" +approvers: + - TBD + +see-also: + - "/keps/sig-aaa/1234-we-heard-you-like-keps" + - "/keps/sig-bbb/2345-everyone-gets-a-kep" +replaces: + - "/keps/sig-ccc/3456-replaced-kep" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha|beta|stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + beta: "v1.35" + stable: "v1.37" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MyFeature + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - my_feature_metric From 0c63a15bc7b3f5ed122a3c575565a032ec26b5e5 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Tue, 10 Jun 2025 14:25:53 -0400 Subject: [PATCH 02/13] Add PRR sections for code --- .../5030-attach-limit-autoscaler/README.md | 310 +++++++----------- 1 file changed, 110 insertions(+), 200 deletions(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/README.md b/keps/sig-storage/5030-attach-limit-autoscaler/README.md index 7018679e98f..c93fa94cda6 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/README.md +++ b/keps/sig-storage/5030-attach-limit-autoscaler/README.md @@ -1,64 +1,4 @@ - -# KEP-NNNN: Your short, descriptive title +# KEP-5030: Integrate Volume Attach limit into cluster autoscaler +Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods. -## Motivation +This leads to bunch of problems: +- If there are unschedulable pods that require more volume than one supported by newly created nodes, there will still be unschedulable pods left. - +Once cluster-autoscaler is aware of CSI volume attach limits, we can fix kubernete's builtin scheduler to not schedule pods to nodes that don't have CSI driver installed. ### Goals - +- Modify cluster-autoscaler so as it is aware of CSI volume limits. +- Fix scheduler, so as it doesn't schedule pods to a node that doesn't have CSI driver installed. ### Non-Goals - +- Deschedule pods that can't fit a node because of race conditions. ## Proposal - +As part of this proposal we are proposing changes into both cluster-autoscaler and kubernetes's built-in scheduler. -### User Stories (Optional) +1. Fix cluster-autoscaler so as it takes into account attach limits when scaling nodes from 0 in a nodegroup. +2. Fix cluster-autoscaler so as it takes into account attach limits when scaling nodegroups with existing nodes. +3. Fix kubernetes built-in scheduler so as we do not schedule pods to nodes that doesn't have CSI driver installed. - + +### User Stories (Optional) #### Story 1 +- User has more than one pod that is pending because no existing node has any attach limit left. +- Cluster autoscaler evaluates existing nodegroups. +- It picks a nodegroup based on existing critireas and it accurately determines number of nodes it needs to spin up based on volumes that pending pods require. #### Story 2 +- A Kubernetes admin has one or more node where CSI driver is not installed. +- Without explicitly tainting the node or using node affinity in worklods, nodes which don't have CSI driver installed aren't used for scheduling pods that require volume. ### Notes/Constraints/Caveats (Optional) - +Scheduler changes can only be merged after cluster-autoscaler changes has been GAed and there are no concerns about scheduler changes. ### Risks and Mitigations @@ -247,12 +162,70 @@ Consider including folks who also work outside the SIG or subproject. ## Design Details - +## Cluster Autoscaler changes + +We can split the implementation in cluster-autoscaler in two parts: +- Scaling a node-group that already has one or more nodes. +- Scaling a node-group that doesn’t have one or more nodes (Scaling from zero). + +### Scaling a node-group that already has one or more nodes. + +1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there. + +However, we also propose that a node will be considered ready as soon as corresponding CSI driver is being reported as installed via corresponding CSINode object. + +A node which is ready but does not have CSI driver installed within certain time limit will be considered as NotReady and removed from the cluster. + + +2. We propose that, we add volume limits and installed CSI driver information to framework.NodeInfo objects. So - + +``` +type NodeInfo struct { +.... +.... +csiDrivers map[string]*DriverAllocatable +.. +} + +type DriverAllocatable struct { + Allocatable int32 +} +``` + +3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes. + +4. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csidriver information, which can be used later on for scheduling decisions. + +``` +nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.taintConfig, currentTime) +``` + +Please note that, we will have to handle the case of scaling from 0, separately from +scaling from 1, because in former case - no CSI volume limit information will be available +If no node exists in a NodeGroup. + +5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes. + +Making aforementioned changes should allow us to handle scaling of nodes from 1. + +### Scaling from zero + +Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup. + +We propose that we introduce similar annotation as CPU, Memory resources in cluster-api to process attach limits available on a node. + +We have to introduce similar mechanism in various cloudproviders which return Template objects to incorporate volume limits. This will allow us to handle the case of scaling from zero. + + +## Kubernetes Scheduler change + +We also propose that, if given node is not reporting any installed CSI drivers, we do not schedule pods that need CSI volumes to that node. + +The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702 + +This will stop too many pods crowding a node, when a new node is spun up and node is not yet reporting volume limits. + +But this alone is not enough to fix the underlying problem. Cluster-autoscaler must be fixed so as it is aware of attach limits of a node via CSINode object. ### Test Plan @@ -299,75 +272,26 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- k8s.io/autoscaler/cluster-autoscaler/core: 06/10/2025 - 77.3% ##### Integration tests - - - - -- : +None ##### e2e tests - - -- : +We will add tests that validate both scaling from 0 and scaling from 1 use cases. ### Graduation Criteria - -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: - - Components depending on the feature gate: +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `VolumeLimitScaling` + - Components depending on the feature gate: `autoscaler`, `scheduler` - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control @@ -481,13 +405,12 @@ well as the [existing list] of feature gates. ###### Does enabling the feature change any default behavior? - +Yes, it will cause cluster-autoscaler to consider volume limits when scaling nodes. It will also cause scheduler to not schedule pods to nodes that don't have CSI driver. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes. This will simply cause old behaviour to be restored. + +Depends on cluster-autoscaler running in the cluster. ### Scalability @@ -792,11 +704,9 @@ Why should this KEP _not_ be implemented? ## Alternatives - + +Certain Kubernetes vendors taint the node when a new node is created and CSI driver has logic to remove the taint when CSI driver starts on the node. +- https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/2309 ## Infrastructure Needed (Optional) From ec94fb9646886467a572ed2416418e56ee174973 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Tue, 10 Jun 2025 17:02:16 -0400 Subject: [PATCH 03/13] Add kep approver --- keps/prod-readiness/sig-storage/5030.yaml | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 keps/prod-readiness/sig-storage/5030.yaml diff --git a/keps/prod-readiness/sig-storage/5030.yaml b/keps/prod-readiness/sig-storage/5030.yaml new file mode 100644 index 00000000000..4ccffe99a53 --- /dev/null +++ b/keps/prod-readiness/sig-storage/5030.yaml @@ -0,0 +1,3 @@ +kep-number: 5030 +alpha: + approver: "@deads2k" From 99f9aa742150d2b820375b9c29b7d81fcd24603d Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Wed, 11 Jun 2025 11:25:28 -0400 Subject: [PATCH 04/13] Update keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml Co-authored-by: Drew Hagen --- keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml index 49fbc9d9d02..3975f9392fe 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml +++ b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml @@ -7,7 +7,7 @@ participating-sigs: - sig-storage - sig-scheduling - sig-autoscaling -status: provisional +status: implementable creation-date: 2025-01-09 reviewers: - TBD From f09321363703d8c4e1b34ad7e214b28246977eec Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Wed, 11 Jun 2025 11:25:35 -0400 Subject: [PATCH 05/13] Update keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml Co-authored-by: Drew Hagen --- keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml index 3975f9392fe..94a9c3edd07 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml +++ b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml @@ -28,7 +28,7 @@ stage: alpha|beta|stable # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.33" +latest-milestone: "v1.34" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From f9c3c6ef08546dc4950f749a4afa354b44252409 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Wed, 11 Jun 2025 11:35:25 -0400 Subject: [PATCH 06/13] Update kep.yaml with latest milestone and remove unneeded sections --- .../5030-attach-limit-autoscaler/kep.yaml | 20 +++++-------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml index 94a9c3edd07..332e3ea3771 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml +++ b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml @@ -16,14 +16,8 @@ reviewers: approvers: - TBD -see-also: - - "/keps/sig-aaa/1234-we-heard-you-like-keps" - - "/keps/sig-bbb/2345-everyone-gets-a-kep" -replaces: - - "/keps/sig-ccc/3456-replaced-kep" - # The target maturity stage in the current dev cycle for this KEP. -stage: alpha|beta|stable +stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively @@ -32,19 +26,15 @@ latest-milestone: "v1.34" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.33" + alpha: "v1.34" beta: "v1.35" stable: "v1.37" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - - name: MyFeature + - name: VolumeLimitScaling components: - - kube-apiserver - - kube-controller-manager + - scheduler + - autoscaler disable-supported: true - -# The following PRR answers are required at beta release -metrics: - - my_feature_metric From fc012c8a9768f025a6d701b8f898b600d9785913 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Wed, 11 Jun 2025 12:01:19 -0400 Subject: [PATCH 07/13] Add approvers and reviewers from sig-autoscaling too --- keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml index 332e3ea3771..c20e790e438 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml +++ b/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml @@ -10,11 +10,16 @@ participating-sigs: status: implementable creation-date: 2025-01-09 reviewers: - - TBD + - "@gjtempleton" - "@jsafrane" - "@msau42" + - "@elmiko" + - "@towca" approvers: - - TBD + - "@gjtempleton" + - "@jsafrane" + - "@msau42" + - "@towca" # The target maturity stage in the current dev cycle for this KEP. stage: alpha From 33da4d98df391783b9d3c2959e728bc410640ff4 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Wed, 11 Jun 2025 16:35:18 -0400 Subject: [PATCH 08/13] Update keps/sig-storage/5030-attach-limit-autoscaler/README.md Co-authored-by: Kevin Hannon --- keps/sig-storage/5030-attach-limit-autoscaler/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/README.md b/keps/sig-storage/5030-attach-limit-autoscaler/README.md index c93fa94cda6..a0a80da9046 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/README.md +++ b/keps/sig-storage/5030-attach-limit-autoscaler/README.md @@ -395,7 +395,7 @@ well as the [existing list] of feature gates. - [x] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: `VolumeLimitScaling` - - Components depending on the feature gate: `autoscaler`, `scheduler` + - Components depending on the feature gate: `autoscaler`, `kube-scheduler` - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control From 0022f89d6e871bea3397d3957146b4433f028db9 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Thu, 12 Jun 2025 11:51:27 -0400 Subject: [PATCH 09/13] Change wording about not scheduling pods that require CSI volumes --- keps/sig-storage/5030-attach-limit-autoscaler/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/README.md b/keps/sig-storage/5030-attach-limit-autoscaler/README.md index a0a80da9046..7f29cae2bc5 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/README.md +++ b/keps/sig-storage/5030-attach-limit-autoscaler/README.md @@ -73,9 +73,9 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented +- [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) @@ -111,12 +111,12 @@ This leads to bunch of problems: - Since a node does not come up with a CSI driver typically, usually too many pods get scheduled on a node, which may not be supportable by the node in the first place. This leads to bunch of pods, just stuck. -Once cluster-autoscaler is aware of CSI volume attach limits, we can fix kubernete's builtin scheduler to not schedule pods to nodes that don't have CSI driver installed. +Once cluster-autoscaler is aware of CSI volume attach limits, we can fix kubernete's builtin scheduler to not schedule pods to nodes that don't have CSI driver installed and if pods require given CSI volumes. ### Goals - Modify cluster-autoscaler so as it is aware of CSI volume limits. -- Fix scheduler, so as it doesn't schedule pods to a node that doesn't have CSI driver installed. +- Fix scheduler, so as it doesn't schedule pods that require given CSI volume to a node that doesn't have CSI driver installed. ### Non-Goals From e71d1d6b56c6bb9c22e99ecba7786dc1afc01d79 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Thu, 12 Jun 2025 14:31:23 -0400 Subject: [PATCH 10/13] Add risks and mitigations and phases --- .../5030-attach-limit-autoscaler/README.md | 23 +++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/README.md b/keps/sig-storage/5030-attach-limit-autoscaler/README.md index 7f29cae2bc5..b1eceffdd3f 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/README.md +++ b/keps/sig-storage/5030-attach-limit-autoscaler/README.md @@ -40,6 +40,8 @@ tags, and then generate with `hack/update-toc.sh`. - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) + - [Phase-1](#phase-1) + - [Phase-2](#phase-2) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -148,6 +150,9 @@ Scheduler changes can only be merged after cluster-autoscaler changes has been G ### Risks and Mitigations +As mentioned above, there is a risk associated with scheduler changes being merged before cluster-autoscaler changes can be merged, but we want to mitigate +this by making sure cluster-autoscaler changes are merged first and has N+2 window before scheduler changes can be made GA. + - - - - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) @@ -132,6 +116,8 @@ As part of this proposal we are proposing changes into both cluster-autoscaler a 2. Fix cluster-autoscaler so as it takes into account attach limits when scaling nodegroups with existing nodes. 3. Fix kubernetes built-in scheduler so as we do not schedule pods to nodes that doesn't have CSI driver installed. +Since cluster-autoscaler changes need to happen first, `kube-schduler` changes are out-of-scope for alpha implementation in v1.34. + ### User Stories (Optional) diff --git a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml b/keps/sig-autoscaling/5030-attach-limit-autoscaler/kep.yaml similarity index 97% rename from keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml rename to keps/sig-autoscaling/5030-attach-limit-autoscaler/kep.yaml index c20e790e438..af929b855b1 100644 --- a/keps/sig-storage/5030-attach-limit-autoscaler/kep.yaml +++ b/keps/sig-autoscaling/5030-attach-limit-autoscaler/kep.yaml @@ -2,7 +2,7 @@ title: Integrate CSI Volume attach limits with cluster autoscaler kep-number: 5030 authors: - "@gnufied" -owning-sig: sig-storage +owning-sig: sig-autoscaling participating-sigs: - sig-storage - sig-scheduling From fd4bef152126d514ac046f2b0ac83b9181ec8d63 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Mon, 16 Jun 2025 13:49:16 -0400 Subject: [PATCH 12/13] Move prr to sig-autoscaling --- keps/prod-readiness/{sig-storage => sig-autoscaling}/5030.yaml | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename keps/prod-readiness/{sig-storage => sig-autoscaling}/5030.yaml (100%) diff --git a/keps/prod-readiness/sig-storage/5030.yaml b/keps/prod-readiness/sig-autoscaling/5030.yaml similarity index 100% rename from keps/prod-readiness/sig-storage/5030.yaml rename to keps/prod-readiness/sig-autoscaling/5030.yaml From 7c7e56c9e87e9bc1cff94a800c18b0c90a6acfb9 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Fri, 20 Jun 2025 12:55:55 -0400 Subject: [PATCH 13/13] Explain k8s-scheduler and autoscaler coupling --- .../5030-attach-limit-autoscaler/README.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/keps/sig-autoscaling/5030-attach-limit-autoscaler/README.md b/keps/sig-autoscaling/5030-attach-limit-autoscaler/README.md index b8f8f2aafb8..a82dcfc482c 100644 --- a/keps/sig-autoscaling/5030-attach-limit-autoscaler/README.md +++ b/keps/sig-autoscaling/5030-attach-limit-autoscaler/README.md @@ -139,6 +139,13 @@ Scheduler changes can only be merged after cluster-autoscaler changes has been G As mentioned above, there is a risk associated with scheduler changes being merged before cluster-autoscaler changes can be merged, but we want to mitigate this by making sure cluster-autoscaler changes are merged first and has N+2 window before scheduler changes can be made GA. + +Please note the risk exists in-terms of `kube-scheduler` code that is vendored in cluster-autoscaler. The risk doesn't exist from independently running `kube-scheduler` +and cluster-autoscaler. So we have to be careful not to vendor a version of `kube-scheduler` in cluster-autoscaler that prevents scheduling of pods (that requires CSI volumes) +to nodes that don't have CSI driver until `cluster-autoscaler` can properly take into account CSI node limits available in upcoming node. + + +