From 130377537e913920c90cc02eef1650a6669957e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 17 May 2024 18:28:59 +0800 Subject: [PATCH 01/30] initial version of "StatefulSet Support for Updating Volume Claim Template" --- .../README.md | 952 ++++++++++++++++++ .../kep.yaml | 50 + 2 files changed, 1002 insertions(+) create mode 100644 keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md create mode 100644 keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md new file mode 100644 index 00000000000..6a84f33f5b7 --- /dev/null +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -0,0 +1,952 @@ + +# KEP-NNNN: StatefulSet Support for Updating Volume Claim Template + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Updated Reconciliation Logic](#updated-reconciliation-logic) + - [What PVC is capatible](#what-pvc-is-capatible) + - [Collected PVC Status](#collected-pvc-status) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) + - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) + - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) + - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [extensively validate the updated volumeClaimTemplate](#extensively-validate-the-updated-volumeclaimtemplate) + - [Only support for updating volumeClaimTemplate.spec.resources.requests.storage](#only-support-for-updating-volumeclaimtemplatespecresourcesrequestsstorage) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Kubernetes does not support the modification of the `volumeClaimTemplate` of a StatefulSet currently. +This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplate`, +automatically updating the associated PersistentVolumeClaim objects in-place if applicable. +Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` +fields can be updated in-place. +For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. +All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates +to honor any dependencies between them. + +## Motivation + + + +Currently there are very few things that users can do to update the volumes of +their existing StatefulSet deployments. +They can only expand the volumes, or modify them with VolumeAttributesClass +by updating individual PersistentVolumeClaim objects as an ad-hoc operation. +When the StatefulSet scales up, the new PVC(s) will be created with the old +config and this again needs manual intervention. +Modifying immutable parameters, shinking, or even switch to another +storage provider is not possible currently. +This brings many headaches in a continuously evolving environment. + +### Goals + + +* Allow users to update the `volumeClaimTemplate` of a `StatefulSet` in place. +* Automatically update the associated PersistentVolumeClaim objects in-place if applicable. +* Support updating PersistentVolumeClaim objects with `OnDelete` strategy. +* Coordinate updates to `Pod` and PersistentVolumeClaim objects. +* Provide accurate status and error messages to users when the update fails. + +### Non-Goals + + +* Support automatic rolling update of PersistentVolumeClaim. +* Validate the updated `volumeClaimTemplate` as how PVC update does. +* Update ephemeral volumes. + + +## Proposal + + +1. Change API server to allow any updates to `volumeClaimTemplate` of a StatefulSet. + +2. Modify StatefulSet controller to add PVC reconciliation logic. + +3. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to + specify how to coordinate the update of PVCs and Pods. Possible values are: + - `OnDeleteAsync`: the default value, preserve the current behavior. + - `OnDeleteLockStep`: update PVCs first, then update Pods. See below for details. + +4. Collect the status of managed PVCs, and show them in the StatefulSet status. + +### Updated Reconciliation Logic + +How to update PVCs: +1. If `volumeClaimTemplate` and actual PVC only differ in mutable fields + (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), + update the PVC in-place to the extent possible. + Do not perform the update that will be rejected by API server, such as + decreasing the storage size below its current status. + Note that decrease the size can help recover from a failed expansion if + `RecoverVolumeExpansionFailure` feature gate is enabled. + +2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), + do nothing. But when recreating a Pod and the corresponding PVC is deleting, + wait for the deletion then create a new PVC with the current template + together with the new Pod. + +When to update PVCs: +1. Before recreate the pod, additionally check that the PVC is + [capatible](#what-pvc-is-capatible) with the new `volumeClaimTemplate`. + If not, update the PVC after old Pod deleted, before creating new pod, + or if update is not possible: + - If `volumeClaimUpdateStrategy` is `OnDeleteLockStep`, + wait for the user to delete the old PVC manually before delete the old pod. + - If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, + the diff is ignored and the pod recreation proceeds. + +2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, + The PVCs should be updated just like Pods would. A replica is considered ready + if all its volumes are capatible with the new `volumeClaimTemplate`. + `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. + e.g.: + - If `.spec.updateStrategy.type` is `RollingUpdate`, + update the PVCs in the order from the largest ordinal to the smallest. + Only proceed to the next ordinal when all the PVCs of the previous ordinal + are capatible with the new `volumeClaimTemplate`. + - If `.spec.updateStrategy.type` is `OnDelete`, + Only update the PVC when the Pod is deleted. + + +### What PVC is capatible + +TODO + +### Collected PVC Status + +TODO + +### User Stories (Optional) + + + +#### Story 1: Batch Expand Volumes + +TODO + +#### Story 2: Migrating Between Storage Providers + +TODO + +#### Story 3: Migrating Between Different Implementations of the Same Storage Provider + +TODO + +#### Story 4: Shinking the PV by Re-creating PVC + +TODO + +### Notes/Constraints/Caveats (Optional) + + + +`volumeClaimUpdateStrategy` is introduce to keep capability of current deployed workloads. +StatefulSet currently accepts and uses existing PVCs that is not created by the controller, +So the `volumeClaimTemplate` and PVC can differ even before this enhancement. +Some users may choose to keep the PVCs of different replicas different. +We should not block the Pod updates for them. + +If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, +then if the template and PVC differs other than mutable fields, and it is not deleting, +the PVC is not considered as managed by the StatefulSet. + +However, a workload may rely on some features provided by a specific PVC, +So we should provide a way to coordinate the update. +That's why we also need `OnDeleteLockStep`. + +We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. +In a stable state, most operations are possible, and we are not actively fixing something. + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StatefulSetUpdateVolumeClaimTemplate + - Components depending on the feature gate: + - kube-apiserver + - kube-controller-manager +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + +If the PVC capacity is smaller than that in the template, +the PVC will be expanded immediately after the feature is enbled. +This should be rare, the user must have created the PVC before the StatefulSet for this to happen. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + +### extensively validate the updated `volumeClaimTemplate` + +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplate`. +e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. +However, this have saveral drawbacks: +* Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, +* This can be a barrier when recovering from a failed update. +* The validation is racy, especially when recovering from failed expansion. + We still need to consider most abnormal cases even we do those validations. +* This does not match the pattern of existing behaviors. + That is, the controller should take the expected state, retry as needed to reach that state. + For example, StatefulSet will not reject a invalid `serviceAccountName`. +* `volumeClaimTemplate` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, + a user may still want to affect new PVCs. + +### Only support for updating `volumeClaimTemplate.spec.resources.requests.storage` + +[KEP-0661] only enables expanding the volume. However, because the StatefulSet can take pre-existing PVCs, +we still need to consider what to do when template and PVC don't match. +The complexity of this proposal will not decrease much if we only support expanding the volume. + +By enabling arbitrary updating to the `volumeClaimTemplate`, +we just acknowledge and officially support this use case. + +[KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml b/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml new file mode 100644 index 00000000000..b922d003c9c --- /dev/null +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/kep.yaml @@ -0,0 +1,50 @@ +title: StatefulSet Support for Updating Volume Claim Template +kep-number: NNNN +authors: + - "@huww98" +owning-sig: sig-storage +participating-sigs: + - sig-app +status: provisional +creation-date: 2024-05-17 +reviewers: + - "@kow3ns" + - "@gnufied" + - "@msau42" + - "@xing-yang" +approvers: + - "@kow3ns" + - "@xing-yang" + +see-also: + - "/keps/sig-storage/1790-recover-resize-failure" + - "/keps/sig-storage/3751-volume-attributes-class" +replaces: + - "https://github.com/kubernetes/enhancements/pull/2842" # Previous attempt on 0611 + - "https://github.com/kubernetes/enhancements/pull/3412" # Previous attempt on 0611 + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.31" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.31" + beta: "v1.32" + stable: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StatefulSetUpdateVolumeClaimTemplate + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: [] From 5533b9886a184ee40b4e4a6ecc080aebd575c3e4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 17 May 2024 20:31:59 +0800 Subject: [PATCH 02/30] what is capatible --- .../README.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 6a84f33f5b7..36cf07c680b 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -289,7 +289,13 @@ When to update PVCs: ### What PVC is capatible -TODO +A PVC is capatible with the template if: +- All the immutable fields match exactly; and +- `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and +- `status.capacity.storage` of PVC is greater than or equal to + the `spec.resources.requests.storage` of the template; and +- `status.currentVolumeAttributesClassName` of PVC is equal to + the `spec.volumeAttributesClassName` of the template. ### Collected PVC Status @@ -369,6 +375,10 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> +We can use Server Side Apply to update the PVCs in-place, +so that we will not interfere with the user's manual changes, +e.g. to `metadata.labels` and `metadata.annotations`. + ### Test Plan -`volumeClaimUpdateStrategy` is introduce to keep capability of current deployed workloads. +`volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, So the `volumeClaimTemplate` and PVC can differ even before this enhancement. Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. -If `volumeClaimUpdateStrategy` is `OnDeleteAsync`, -then if the template and PVC differs other than mutable fields, and it is not deleting, +If `volumeClaimSyncStrategy` is `Async`, +then if the template and PVC differs, and the PVC is not being deleted, the PVC is not considered as managed by the StatefulSet. However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. -That's why we also need `OnDeleteLockStep`. +That's why we also need `LockStep`. We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. In a stable state, most operations are possible, and we are not actively fixing something. @@ -612,9 +618,10 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -If the PVC capacity is smaller than that in the template, -the PVC will be expanded immediately after the feature is enbled. -This should be rare, the user must have created the PVC before the StatefulSet for this to happen. +If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), +the behavior of StatefulSet controller is almost the same as before. +Except that if the PVC is deleting when performing rolling update, the controller will wait for the deletion +before creating the new Pod. This may bring additional delay if the PVC deletion is somehow blocked. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? From 3334f65b6dde831588662723b0d651beacfe4eee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 17:42:20 +0800 Subject: [PATCH 04/30] User stories --- .../README.md | 31 ++++++++++++++++--- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 5e12fbc1df1..ff795cb6563 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -85,6 +85,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) + - [Story 5: Asymmetric Replicas](#story-5-asymmetric-replicas) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -318,19 +319,41 @@ bogged down. #### Story 1: Batch Expand Volumes -TODO +We're running a CI/CD system and the end-to-end automation is desired. +To expand the volumes managed by a StatefulSet, +we can just use the same pipeline that we are already using to updating the Pod. +All the test, review, approval, and rollback process can be reused. #### Story 2: Migrating Between Storage Providers -TODO +We decide to switch from home-made local storage to the storage provided by a cloud provider. +We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. +Our app can automatically rebuild the data in the new storage from other replicas. +So we update the `volumeClaimTemplate` of the StatefulSet, +delete the PVC and Pod of one replica, let the controller re-create them, +then monitor the rebuild process. +Once the rebuild completes successfully, we proceed to the next replica. #### Story 3: Migrating Between Different Implementations of the Same Storage Provider -TODO +Our storage provider has a new version that provides new features, but can not be upgraded in-place. +We can prepare some new PersistentVolumes using the new version, but referencing the same disk +from the provider as the in-use PVs. +Then the same update process as Story 2 can be used. +Although the PVCs are recreated, the data is preserved, so no rebuild is needed. #### Story 4: Shinking the PV by Re-creating PVC -TODO +After running our app for a while, we optimize the data layout and reduce the required storage size. +Now we want to shrink the PVs to save cost. +The same process as Story 2 can be used. + +#### Story 5: Asymmetric Replicas + +The replicas of our StatefulSet are not identical, so we still want to update +each PVC manually and separately. +Possibly we also update the `volumeClaimTemplate` for new replicas, +but we don't want the controller to interfere with the existing replicas. ### Notes/Constraints/Caveats (Optional) From 17f9f77723d1a121c9ac91337c0e395e1aaf92c6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 18:01:02 +0800 Subject: [PATCH 05/30] we are already waiting for PVC deletion --- .../NNNN-stateful-set-update-claim-template/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index ff795cb6563..a177a86d766 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -269,7 +269,11 @@ How to update PVCs: 2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC with the current template - together with the new Pod. + together with the new Pod (already implemented). + When to update PVCs: 1. Before recreate the pod, additionally check that the PVC is @@ -641,10 +645,9 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> +No. If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), the behavior of StatefulSet controller is almost the same as before. -Except that if the PVC is deleting when performing rolling update, the controller will wait for the deletion -before creating the new Pod. This may bring additional delay if the PVC deletion is somehow blocked. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? From 7b867242e3f0e7d8d5fd194e8812155b8ebd7783 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 19:19:07 +0800 Subject: [PATCH 06/30] status --- .../README.md | 47 +++++++++++++++---- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index a177a86d766..803eb8437f1 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -78,7 +78,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Updated Reconciliation Logic](#updated-reconciliation-logic) - - [What PVC is capatible](#what-pvc-is-capatible) + - [What PVC is compatible](#what-pvc-is-compatible) - [Collected PVC Status](#collected-pvc-status) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) @@ -266,7 +266,7 @@ How to update PVCs: Note that decrease the size can help recover from a failed expansion if `RecoverVolumeExpansionFailure` feature gate is enabled. -2. If it is not possible to make the PVC [capatible](#what-pvc-is-capatible), +2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC with the current template together with the new Pod (already implemented). @@ -277,7 +277,7 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre When to update PVCs: 1. Before recreate the pod, additionally check that the PVC is - [capatible](#what-pvc-is-capatible) with the new `volumeClaimTemplate`. + [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. If not, update the PVC after old Pod deleted, before creating new pod, or if update is not possible: - If `volumeClaimSyncStrategy` is `LockStep`, @@ -287,20 +287,20 @@ When to update PVCs: 2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, The PVCs should be updated just like Pods would. A replica is considered ready - if all its volumes are capatible with the new `volumeClaimTemplate`. + if all its volumes are compatible with the new `volumeClaimTemplate`. `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. e.g.: - If `.spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. Only proceed to the next ordinal when all the PVCs of the previous ordinal - are capatible with the new `volumeClaimTemplate`. + are compatible with the new `volumeClaimTemplate`. - If `.spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. -### What PVC is capatible +### What PVC is compatible -A PVC is capatible with the template if: +A PVC is compatible with the template if: - All the immutable fields match exactly; and - `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and - `status.capacity.storage` of PVC is greater than or equal to @@ -310,7 +310,23 @@ A PVC is capatible with the template if: ### Collected PVC Status -TODO +For each PVC in the template: +- compatible: the number of PVCs that are compatible with the template. + These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. +- updating: the number of PVCs that are being updated in-place. +- overSized: the number of PVCs that are over-sized. +- totalCapacity: the sum of `status.capacity` of all the PVCs. + +Some fields in the `status` are also updated to reflect the staus of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: + - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; + - `volumeClaimSyncStrategy` is `LockStep` and the PVC is not compatible with the template; +- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for in-place update and for the PVCs that need manual intervention. ### User Stories (Optional) @@ -382,9 +398,12 @@ However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -We consider a StatefulSet in stable state if all the managed PVCs are capatible with the current template. +We consider a StatefulSet in stable state if all the managed PVCs are compatible with the current template. In a stable state, most operations are possible, and we are not actively fixing something. +The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, +so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. + ### Risks and Mitigations +When the `volumeClaimSyncStrategy` is set to `LockStep`, keeping PVCs that are +incompatible with the template is dangerous. This will block the Pod from being +recreated, and the workload will be unavailable if some Pods are evicted. +We should document this clearly and report the replica as not ready in the status +to warn the user. +this should only happen when the user manually updates the PVC, +or the `volumeClaimSyncStrategy` is updated to `LockStep` while the PVC is not compatible. + +TODO: Recover from failed in-place update (insufficient storage, etc.) +What else is needed in addition to revert the StatefulSet spec? ## Design Details From d1fb4ed34a5e9514f932a4cdfae2c283ebb1d533 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 19 May 2024 22:32:25 +0800 Subject: [PATCH 07/30] check compatible before advancing updatedReplicas --- .../README.md | 23 +++++++------------ 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 803eb8437f1..38b98a92173 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -276,25 +276,27 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre --> When to update PVCs: -1. Before recreate the pod, additionally check that the PVC is +1. Before advancing `status.updatedReplicas` to the next replica, + additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. If not, update the PVC after old Pod deleted, before creating new pod, or if update is not possible: + - If `volumeClaimSyncStrategy` is `LockStep`, - wait for the user to delete/update the old PVC manually before delete the old pod. + wait for the user to delete/update the old PVC manually. - If `volumeClaimSyncStrategy` is `Async`, - the diff is ignored and the pod recreation proceeds. + the diff is ignored and the normal rolling update proceeds. 2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, The PVCs should be updated just like Pods would. A replica is considered ready if all its volumes are compatible with the new `volumeClaimTemplate`. - `.spec.ordinals` and `.spec.updateStrategy.rollingUpdate.partition` are also respected. + `spec.ordinals` and `spec.updateStrategy.rollingUpdate.partition` are also respected. e.g.: - - If `.spec.updateStrategy.type` is `RollingUpdate`, + - If `spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. Only proceed to the next ordinal when all the PVCs of the previous ordinal are compatible with the new `volumeClaimTemplate`. - - If `.spec.updateStrategy.type` is `OnDelete`, + - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. @@ -320,7 +322,6 @@ For each PVC in the template: Some fields in the `status` are also updated to reflect the staus of the PVCs: - readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; - - `volumeClaimSyncStrategy` is `LockStep` and the PVC is not compatible with the template; - availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` - currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs. @@ -417,14 +418,6 @@ How will UX be reviewed, and by whom? Consider including folks who also work outside the SIG or subproject. --> -When the `volumeClaimSyncStrategy` is set to `LockStep`, keeping PVCs that are -incompatible with the template is dangerous. This will block the Pod from being -recreated, and the workload will be unavailable if some Pods are evicted. -We should document this clearly and report the replica as not ready in the status -to warn the user. -this should only happen when the user manually updates the PVC, -or the `volumeClaimSyncStrategy` is updated to `LockStep` while the PVC is not compatible. - TODO: Recover from failed in-place update (insufficient storage, etc.) What else is needed in addition to revert the StatefulSet spec? From b73739d1e6b2c10d179bf5e45cc90c3ac61df4ef Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Wed, 22 May 2024 20:10:24 +0800 Subject: [PATCH 08/30] add Kubernetes API Changes section --- .../README.md | 52 +++++++++++-------- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md index 38b98a92173..81d53fc3adc 100644 --- a/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/NNNN-stateful-set-update-claim-template/README.md @@ -77,9 +77,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) + - [Kubernetes API Changes](#kubernetes-api-changes) - [Updated Reconciliation Logic](#updated-reconciliation-logic) - [What PVC is compatible](#what-pvc-is-compatible) - - [Collected PVC Status](#collected-pvc-status) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) @@ -242,17 +242,42 @@ nitty-gritty. 2. Modify StatefulSet controller to add PVC reconciliation logic. -3. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to +3. Collect the status of managed PVCs, and show them in the StatefulSet status. + +### Kubernetes API Changes + +Changes to StatefulSet `spec`: + +1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to specify how to coordinate the update of PVCs and Pods. Possible values are: - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - `InPlace`: update the PVC in-place if possible. Also includes the `OnDelete` behavior. -4. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` +2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` to specify how to update PVCs and Pods. Possible values are: - `Async`: the default value, preseve the current behavior. - `LockStep`: update PVCs first, then update Pods. See below for details. -5. Collect the status of managed PVCs, and show them in the StatefulSet status. +Changes to StatefultSet `status`: + +Additionally collect the status of managed PVCs, and show them in the StatefulSet status. + +For each PVC in the template: +- compatible: the number of PVCs that are compatible with the template. + These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. +- updating: the number of PVCs that are being updated in-place. +- overSized: the number of PVCs that are over-sized. +- totalCapacity: the sum of `status.capacity` of all the PVCs. + +Some fields in the `status` are also updated to reflect the staus of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: + - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; +- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for in-place update and for the PVCs that need manual intervention. ### Updated Reconciliation Logic @@ -310,25 +335,6 @@ A PVC is compatible with the template if: - `status.currentVolumeAttributesClassName` of PVC is equal to the `spec.volumeAttributesClassName` of the template. -### Collected PVC Status - -For each PVC in the template: -- compatible: the number of PVCs that are compatible with the template. - These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. -- updating: the number of PVCs that are being updated in-place. -- overSized: the number of PVCs that are over-sized. -- totalCapacity: the sum of `status.capacity` of all the PVCs. - -Some fields in the `status` are also updated to reflect the staus of the PVCs: -- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: - - `volumeClaimUpdateStrategy` is `InPlace` and the PVC is updating; -- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` -- currentRevision, updateRevision, currentReplicas, updatedReplicas - are updated to reflect the status of PVCs. - -With these changes, user can still use `kubectl rollout status` to monitor the update process, -both for in-place update and for the PVCs that need manual intervention. - ### User Stories (Optional) -# KEP-NNNN: StatefulSet Support for Updating Volume Claim Template +# KEP-4650: StatefulSet Support for Updating Volume Claim Template +3. Use either current or updated revision of the `volumeClaimTemplate` to create/update the PVC, + just like Pod template. + When to update PVCs: -1. Before advancing `status.updatedReplicas` to the next replica, +1. If `volumeClaimSyncStrategy` is `LockStep`, + before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. - If not, update the PVC after old Pod deleted, before creating new pod, - or if update is not possible: - - - If `volumeClaimSyncStrategy` is `LockStep`, - wait for the user to delete/update the old PVC manually. - - If `volumeClaimSyncStrategy` is `Async`, - the diff is ignored and the normal rolling update proceeds. - -2. If Pod spec does not change, only mutable fields in `volumeClaimTemplate` differ, - The PVCs should be updated just like Pods would. A replica is considered ready - if all its volumes are compatible with the new `volumeClaimTemplate`. - `spec.ordinals` and `spec.updateStrategy.rollingUpdate.partition` are also respected. + If not, and we are not going to update it in-place automatically, + wait for the user to delete/update the old PVC manually. + +2. When doing rolling update, A replica is considered ready if the Pod is ready + and all its volumes are not being updated in-place. + Wait for a replica to be ready for at least `minReadySeconds` before proceeding to the next replica. + +3. Whenever we check for Pod update, also check for PVCs update. e.g.: - If `spec.updateStrategy.type` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. - Only proceed to the next ordinal when all the PVCs of the previous ordinal - are compatible with the new `volumeClaimTemplate`. - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. + +4. When updating the PVC in-place, if we also re-create the Pod, + update the PVC after old Pod deleted, together with creating new pod. + Otherwise, if pod is not changed, update the PVC only. + +Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. + +- If the PVC update fails, we should block the update process. + If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. + We should retry and report events for this. + The events and status should look like those when the Pod creation fails. + +- While waiting for the PVC to reach the compatible state, + We should update status, just like what we do when waiting for Pod to be ready. + We should block the update process if the PVC is never compatible. + +- If the `volumeClaimTemplate` is updated again when the previous rollout is blocked, + similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), + user may need to manually deal with the blocking PVCs (update or delete them). ### What PVC is compatible @@ -348,7 +364,7 @@ bogged down. We're running a CI/CD system and the end-to-end automation is desired. To expand the volumes managed by a StatefulSet, -we can just use the same pipeline that we are already using to updating the Pod. +we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused. #### Story 2: Migrating Between Storage Providers @@ -377,8 +393,8 @@ The same process as Story 2 can be used. #### Story 5: Asymmetric Replicas -The replicas of our StatefulSet are not identical, so we still want to update -each PVC manually and separately. +The storage requirement of different replicas are not identical, +so we still want to update each PVC manually and separately. Possibly we also update the `volumeClaimTemplate` for new replicas, but we don't want the controller to interfere with the existing replicas. @@ -391,6 +407,10 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> +When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. +i.e. we update the PVC whenever we would re-create the Pod; +we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. + `volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, So the `volumeClaimTemplate` and PVC can differ even before this enhancement. @@ -398,16 +418,14 @@ Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. If `volumeClaimSyncStrategy` is `Async`, -then if the template and PVC differs, and the PVC is not being deleted, -the PVC is not considered as managed by the StatefulSet. +we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplate`, +as what we do currently. +Of course, we report this in the status of the StatefulSet. However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -We consider a StatefulSet in stable state if all the managed PVCs are compatible with the current template. -In a stable state, most operations are possible, and we are not actively fixing something. - The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. @@ -994,7 +1012,8 @@ information to express the idea and why it was not acceptable. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: * Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, -* This can be a barrier when recovering from a failed update. +* The validation can be a barrier when recovering from a failed update. + If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. * The validation is racy, especially when recovering from failed expansion. We still need to consider most abnormal cases even we do those validations. * This does not match the pattern of existing behaviors. From 175b7d0a15bd9c8116f0e93abbf1a1c56c36a3ac Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Thu, 23 May 2024 11:53:49 +0800 Subject: [PATCH 11/30] volumeClaimTemplates --- .../README.md | 47 ++++++++++--------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md b/keps/sig-storage/4650-stateful-set-update-claim-template/README.md index 2fa59d523dc..ba924b25cac 100644 --- a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-storage/4650-stateful-set-update-claim-template/README.md @@ -107,8 +107,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - - [extensively validate the updated volumeClaimTemplate](#extensively-validate-the-updated-volumeclaimtemplate) - - [Only support for updating volumeClaimTemplate.spec.resources.requests.storage](#only-support-for-updating-volumeclaimtemplatespecresourcesrequestsstorage) + - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) + - [Only support for updating storage size](#only-support-for-updating-storage-size) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -175,8 +175,8 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -Kubernetes does not support the modification of the `volumeClaimTemplate` of a StatefulSet currently. -This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplate`, +Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. +This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, automatically updating the associated PersistentVolumeClaim objects in-place if applicable. Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` fields can be updated in-place. @@ -211,7 +211,7 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplate` of a `StatefulSet` in place. +* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place. * Automatically update the associated PersistentVolumeClaim objects in-place if applicable. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. @@ -224,7 +224,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> * Support automatic rolling update of PersistentVolumeClaim. -* Validate the updated `volumeClaimTemplate` as how PVC update does. +* Validate the updated `volumeClaimTemplates` as how PVC update does. * Update ephemeral volumes. @@ -238,7 +238,7 @@ implementation. What is the desired outcome and how do we measure success?. The "Design Details" section below is for the real nitty-gritty. --> -1. Change API server to allow any updates to `volumeClaimTemplate` of a StatefulSet. +1. Change API server to allow any updates to `volumeClaimTemplates` of a StatefulSet. 2. Modify StatefulSet controller to add PVC reconciliation logic. @@ -283,7 +283,7 @@ both for in-place update and for the PVCs that need manual intervention. How to update PVCs: 1. If `volumeClaimUpdateStrategy` is `InPlace`, - and if `volumeClaimTemplate` and actual PVC only differ in mutable fields + and if `volumeClaimTemplates` and actual PVC only differ in mutable fields (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), update the PVC in-place to the extent possible. Do not perform the update that will be rejected by API server, such as @@ -299,14 +299,14 @@ Tested on Kubernetes v1.28, and I can see this event: Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller create Pod test-rwop-0 in StatefulSet test-rwop failed error: pvc data-test-rwop-0 is being deleted --> -3. Use either current or updated revision of the `volumeClaimTemplate` to create/update the PVC, +3. Use either current or updated revision of the `volumeClaimTemplates` to create/update the PVC, just like Pod template. When to update PVCs: 1. If `volumeClaimSyncStrategy` is `LockStep`, before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are - [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplate`. + [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. If not, and we are not going to update it in-place automatically, wait for the user to delete/update the old PVC manually. @@ -336,7 +336,7 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd We should update status, just like what we do when waiting for Pod to be ready. We should block the update process if the PVC is never compatible. -- If the `volumeClaimTemplate` is updated again when the previous rollout is blocked, +- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), user may need to manually deal with the blocking PVCs (update or delete them). @@ -372,7 +372,7 @@ All the test, review, approval, and rollback process can be reused. We decide to switch from home-made local storage to the storage provided by a cloud provider. We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. Our app can automatically rebuild the data in the new storage from other replicas. -So we update the `volumeClaimTemplate` of the StatefulSet, +So we update the `volumeClaimTemplates` of the StatefulSet, delete the PVC and Pod of one replica, let the controller re-create them, then monitor the rebuild process. Once the rebuild completes successfully, we proceed to the next replica. @@ -395,7 +395,7 @@ The same process as Story 2 can be used. The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. -Possibly we also update the `volumeClaimTemplate` for new replicas, +Possibly we also update the `volumeClaimTemplates` for new replicas, but we don't want the controller to interfere with the existing replicas. ### Notes/Constraints/Caveats (Optional) @@ -413,12 +413,12 @@ we wait for the PVC to be compatible whenever we would wait for the Pod to be re `volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. StatefulSet currently accepts and uses existing PVCs that is not created by the controller, -So the `volumeClaimTemplate` and PVC can differ even before this enhancement. +So the `volumeClaimTemplates` and PVC can differ even before this enhancement. Some users may choose to keep the PVCs of different replicas different. We should not block the Pod updates for them. If `volumeClaimSyncStrategy` is `Async`, -we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplate`, +we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplates`, as what we do currently. Of course, we report this in the status of the StatefulSet. @@ -426,7 +426,7 @@ However, a workload may rely on some features provided by a specific PVC, So we should provide a way to coordinate the update. That's why we also need `LockStep`. -The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplate`, +The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. ### Risks and Mitigations @@ -1006,12 +1006,12 @@ What other approaches did you consider, and why did you rule them out? These do not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> -### extensively validate the updated `volumeClaimTemplate` +### Extensively validate the updated `volumeClaimTemplates` -[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplate`. +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: -* Not reverting the `volumeClaimTemplate` when rollback the StatefulSet is confusing, +* Not reverting the `volumeClaimTemplates` when rollback the StatefulSet is confusing, * The validation can be a barrier when recovering from a failed update. If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. * The validation is racy, especially when recovering from failed expansion. @@ -1019,16 +1019,17 @@ However, this have saveral drawbacks: * This does not match the pattern of existing behaviors. That is, the controller should take the expected state, retry as needed to reach that state. For example, StatefulSet will not reject a invalid `serviceAccountName`. -* `volumeClaimTemplate` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, +* `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs. -### Only support for updating `volumeClaimTemplate.spec.resources.requests.storage` +### Only support for updating storage size -[KEP-0661] only enables expanding the volume. However, because the StatefulSet can take pre-existing PVCs, +[KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. +However, because the StatefulSet can take pre-existing PVCs, we still need to consider what to do when template and PVC don't match. The complexity of this proposal will not decrease much if we only support expanding the volume. -By enabling arbitrary updating to the `volumeClaimTemplate`, +By enabling arbitrary updating to the `volumeClaimTemplates`, we just acknowledge and officially support this use case. [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 From e88af87d2ec2e0d5ebf10826dbc96cfcbe675a6a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sat, 25 May 2024 23:40:32 +0800 Subject: [PATCH 12/30] Change the owning-sig to sig-apps --- .../4650-stateful-set-update-claim-template/README.md | 0 .../4650-stateful-set-update-claim-template/kep.yaml | 4 ++-- 2 files changed, 2 insertions(+), 2 deletions(-) rename keps/{sig-storage => sig-apps}/4650-stateful-set-update-claim-template/README.md (100%) rename keps/{sig-storage => sig-apps}/4650-stateful-set-update-claim-template/kep.yaml (97%) diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md similarity index 100% rename from keps/sig-storage/4650-stateful-set-update-claim-template/README.md rename to keps/sig-apps/4650-stateful-set-update-claim-template/README.md diff --git a/keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml similarity index 97% rename from keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml rename to keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index 97160bbc9d2..3a9e5ebf8db 100644 --- a/keps/sig-storage/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -2,9 +2,9 @@ title: StatefulSet Support for Updating Volume Claim Template kep-number: 4650 authors: - "@huww98" -owning-sig: sig-storage +owning-sig: sig-apps participating-sigs: - - sig-app + - sig-storage status: provisional creation-date: 2024-05-17 reviewers: From cfa7473716133bf9846d6dbbdbb054ca8ae37f6b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Mon, 17 Jun 2024 19:44:33 +0800 Subject: [PATCH 13/30] update for comments Production Readiness review, etc. --- keps/prod-readiness/sig-apps/4650.yaml | 3 ++ .../README.md | 43 +++++++++++++++---- .../kep.yaml | 6 +-- 3 files changed, 40 insertions(+), 12 deletions(-) create mode 100644 keps/prod-readiness/sig-apps/4650.yaml diff --git a/keps/prod-readiness/sig-apps/4650.yaml b/keps/prod-readiness/sig-apps/4650.yaml new file mode 100644 index 00000000000..31adc0d5d14 --- /dev/null +++ b/keps/prod-readiness/sig-apps/4650.yaml @@ -0,0 +1,3 @@ +kep-number: 4650 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index ba924b25cac..68c67c31edf 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -201,8 +201,8 @@ They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. -Modifying immutable parameters, shinking, or even switch to another -storage provider is not possible currently. +Modifying immutable parameters, shrinking, or even switching to another +storage provider is not currently possible. This brings many headaches in a continuously evolving environment. ### Goals @@ -678,12 +678,6 @@ well as the [existing list] of feature gates. - Components depending on the feature gate: - kube-apiserver - kube-controller-manager -- [ ] Other - - Describe the mechanism: - - Will enabling / disabling the feature require downtime of the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? ###### Does enabling the feature change any default behavior? @@ -691,7 +685,9 @@ well as the [existing list] of feature gates. Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -No. +The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. + +Otherwise No. If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), the behavior of StatefulSet controller is almost the same as before. @@ -707,9 +703,17 @@ feature. NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> +Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, +disable this feature gate should not leave any inconsistent state. + +If the `volumeClaimTemplates` is updated then the feature is disabled and the StatefulSet is rolled back, +The `volumeClaimTemplates` will be kept as the latest version, and the history of them will be lost. ###### What happens if we reenable the feature if it was previously rolled back? +If the `volumeClaimUpdateStrategy` is already set to `InPlace` reenable the feature +will kick off the update process immediately. + ###### Are there any tests for feature enablement/disablement? +Will add unit tests for the StatefulSet controller with and without the feature gate, +`volumeClaimUpdateStrategy` set to `InPlace` and `OnDelete` respectively. ### Rollout, Upgrade and Rollback Planning @@ -886,6 +892,16 @@ Focusing mostly on: - periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.) --> +- PATCH StatefulSet + - kubectl or other user agents +- PATCH PersistentVolumeClaim + - 1 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update +- PATCH StatefulSet status + - 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update and PVC status update ###### Will enabling / using this feature result in introducing new API types? @@ -895,6 +911,7 @@ Describe them, providing: - Supported number of objects per cluster - Supported number of objects per namespace (for namespace-scoped objects) --> +No ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -903,6 +920,7 @@ Describe them, providing: - Which API(s): - Estimated increase: --> +Not directly. The cloud provider may be called when the PVCs are updated. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? @@ -912,6 +930,9 @@ Describe them, providing: - Estimated increase in size: (e.g., new annotation of size 32B) - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> +StatefulSet: +- `spec`: 2 new enum fields, ~10B +- `status`: 4 new integer fields, ~10B ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -923,6 +944,7 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> +No. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? @@ -935,6 +957,8 @@ This through this both in small and large cases, again with respect to the [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md --> +The logic of StatefulSet controller is more complex, more CPU will be used. +TODO: measure the actual increase. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -947,6 +971,7 @@ If any of the resources can be exhausted, how this is mitigated with the existin Are there any tests that were run/should be run to understand performance characteristics better and validate the declared limits? --> +No. ### Troubleshooting diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index 3a9e5ebf8db..89587d8f26f 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -2,6 +2,7 @@ title: StatefulSet Support for Updating Volume Claim Template kep-number: 4650 authors: - "@huww98" + - "@vie-serendipity" owning-sig: sig-apps participating-sigs: - sig-storage @@ -12,6 +13,7 @@ reviewers: - "@gnufied" - "@msau42" - "@xing-yang" + - "@soltysh" approvers: - "@kow3ns" - "@xing-yang" @@ -33,9 +35,7 @@ latest-milestone: "v1.31" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.31" - beta: "v1.32" - stable: "v1.33" + alpha: "v1.32" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From 82d0a102e6d14de6fc338e96097f7cbf2d93d87b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Tue, 18 Jun 2024 00:19:57 +0800 Subject: [PATCH 14/30] some clarifications --- .../README.md | 55 ++++++++++++------- 1 file changed, 35 insertions(+), 20 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 68c67c31edf..24f3cd3e236 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -177,9 +177,9 @@ updates. Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, -automatically updating the associated PersistentVolumeClaim objects in-place if applicable. -Currently, PVC `spec.resources.requests.storage` and `spec.volumeAttributesClassName` -fields can be updated in-place. +automatically patching the associated PersistentVolumeClaim objects if applicable. +Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` +can be patched. For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates to honor any dependencies between them. @@ -211,8 +211,8 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place. -* Automatically update the associated PersistentVolumeClaim objects in-place if applicable. +* Allow users to update the `volumeClaimTemplates` of a `StatefulSet`. +* Automatically patch the associated PersistentVolumeClaim objects if applicable, without interrupting the running Pods. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. * Provide accurate status and error messages to users when the update fails. @@ -223,8 +223,8 @@ know that this has succeeded? What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -* Support automatic rolling update of PersistentVolumeClaim. -* Validate the updated `volumeClaimTemplates` as how PVC update does. +* Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. +* Validate the updated `volumeClaimTemplates` as how PVC patch does. * Update ephemeral volumes. @@ -251,7 +251,7 @@ Changes to StatefulSet `spec`: 1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to specify how to coordinate the update of PVCs and Pods. Possible values are: - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - - `InPlace`: update the PVC in-place if possible. Also includes the `OnDelete` behavior. + - `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. 2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` to specify how to update PVCs and Pods. Possible values are: @@ -265,7 +265,7 @@ Additionally collect the status of managed PVCs, and show them in the StatefulSe For each PVC in the template: - compatible: the number of PVCs that are compatible with the template. These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. -- updating: the number of PVCs that are being updated in-place. +- updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). - overSized: the number of PVCs that are over-sized. - totalCapacity: the sum of `status.capacity` of all the PVCs. @@ -277,7 +277,7 @@ Some fields in the `status` are also updated to reflect the staus of the PVCs: are updated to reflect the status of PVCs. With these changes, user can still use `kubectl rollout status` to monitor the update process, -both for in-place update and for the PVCs that need manual intervention. +both for automated patching and for the PVCs that need manual intervention. ### Updated Reconciliation Logic @@ -285,11 +285,13 @@ How to update PVCs: 1. If `volumeClaimUpdateStrategy` is `InPlace`, and if `volumeClaimTemplates` and actual PVC only differ in mutable fields (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), - update the PVC in-place to the extent possible. - Do not perform the update that will be rejected by API server, such as - decreasing the storage size below its current status. - Note that decrease the size can help recover from a failed expansion if - `RecoverVolumeExpansionFailure` feature gate is enabled. + patch the PVC to the extent possible. + - `spec.resources.requests.storage` is patched to max(template spec, PVC status) + - Do not decreasing the storage size below its current status. + Note that decrease the size in PVC spec can help recover from a failed expansion if + `RecoverVolumeExpansionFailure` feature gate is enabled. + - `spec.volumeAttributesClassName` is patched to the template value. + - `metadata.labels` and `metadata.annotations` are patched with server side apply. 2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), do nothing. But when recreating a Pod and the corresponding PVC is deleting, @@ -307,7 +309,7 @@ When to update PVCs: before advancing `status.updatedReplicas` to the next replica, additionally check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. - If not, and we are not going to update it in-place automatically, + If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually. 2. When doing rolling update, A replica is considered ready if the Pod is ready @@ -321,7 +323,7 @@ When to update PVCs: - If `spec.updateStrategy.type` is `OnDelete`, Only update the PVC when the Pod is deleted. -4. When updating the PVC in-place, if we also re-create the Pod, +4. When patching the PVC, if we also re-create the Pod, update the PVC after old Pod deleted, together with creating new pod. Otherwise, if pod is not changed, update the PVC only. @@ -454,7 +456,7 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -We can use Server Side Apply to update the PVCs in-place, +We can use Server Side Apply to patch the PVCs, so that we will not interfere with the user's manual changes, e.g. to `metadata.labels` and `metadata.annotations`. @@ -1050,12 +1052,25 @@ However, this have saveral drawbacks: ### Only support for updating storage size [KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. -However, because the StatefulSet can take pre-existing PVCs, +However, +1. because the StatefulSet can take pre-existing PVCs, we still need to consider what to do when template and PVC don't match. The complexity of this proposal will not decrease much if we only support expanding the volume. - By enabling arbitrary updating to the `volumeClaimTemplates`, we just acknowledge and officially support this use case. +1. We have VAC now, which is expected to go to beta soon. +And can be patched to existing PVC. We should also support patching VAC +by updating `volumeClaimTemplates`. + +### Patch PVCs regardless of the immutable fields + +We propose to patch the PVCs only when the immutable fields match. + +If only expansion is supported, patching regardless of the immutable fields can be a logical choice. +But this KEP also integrates with VAC. VAC is closely coupled with storage class. +Only patching VAC if storage class matches is a very logical choice. +And we'd better follow the same operation model for all mutable fields. + [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 From 7b991cb6ef2224ef3bc4ec695b6e82217238292f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 12 Jul 2024 10:55:29 +0800 Subject: [PATCH 15/30] Remove volumeClaimSyncStrategy. Don't allow editing immutable PVC fields. --- .../README.md | 137 ++++++++---------- 1 file changed, 58 insertions(+), 79 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 24f3cd3e236..7d69552f181 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -82,10 +82,8 @@ tags, and then generate with `hack/update-toc.sh`. - [What PVC is compatible](#what-pvc-is-compatible) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) - - [Story 2: Migrating Between Storage Providers](#story-2-migrating-between-storage-providers) - - [Story 3: Migrating Between Different Implementations of the Same Storage Provider](#story-3-migrating-between-different-implementations-of-the-same-storage-provider) - - [Story 4: Shinking the PV by Re-creating PVC](#story-4-shinking-the-pv-by-re-creating-pvc) - - [Story 5: Asymmetric Replicas](#story-5-asymmetric-replicas) + - [Story 2: Shinking the PV by Re-creating PVC](#story-2-shinking-the-pv-by-re-creating-pvc) + - [Story 3: Asymmetric Replicas](#story-3-asymmetric-replicas) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -108,7 +106,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) - - [Only support for updating storage size](#only-support-for-updating-storage-size) + - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) + - [Patch PVCs regardless of the immutable fields](#patch-pvcs-regardless-of-the-immutable-fields) +- [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -176,11 +176,10 @@ updates. --> Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. -This enhancement proposes to support arbitrary modifications to the `volumeClaimTemplates`, +This enhancement proposes to support modifications to the `volumeClaimTemplates`, automatically patching the associated PersistentVolumeClaim objects if applicable. Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` can be patched. -For other fields, we support updating existing PersistentVolumeClaim objects with `OnDelete` strategy. All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates to honor any dependencies between them. @@ -201,8 +200,6 @@ They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. -Modifying immutable parameters, shrinking, or even switching to another -storage provider is not currently possible. This brings many headaches in a continuously evolving environment. ### Goals @@ -211,8 +208,8 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update the `volumeClaimTemplates` of a `StatefulSet`. -* Automatically patch the associated PersistentVolumeClaim objects if applicable, without interrupting the running Pods. +* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`. +* Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods. * Support updating PersistentVolumeClaim objects with `OnDelete` strategy. * Coordinate updates to `Pod` and PersistentVolumeClaim objects. * Provide accurate status and error messages to users when the update fails. @@ -226,6 +223,7 @@ and make progress. * Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. * Validate the updated `volumeClaimTemplates` as how PVC patch does. * Update ephemeral volumes. +* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs. ## Proposal @@ -238,7 +236,11 @@ implementation. What is the desired outcome and how do we measure success?. The "Design Details" section below is for the real nitty-gritty. --> -1. Change API server to allow any updates to `volumeClaimTemplates` of a StatefulSet. +1. Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: + * `labels` + * `annotations` + * `resources.requests.storage` + * `volumeAttributesClassName` 2. Modify StatefulSet controller to add PVC reconciliation logic. @@ -248,15 +250,10 @@ nitty-gritty. Changes to StatefulSet `spec`: -1. Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to - specify how to coordinate the update of PVCs and Pods. Possible values are: - - `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. - - `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. - -2. Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy` - to specify how to update PVCs and Pods. Possible values are: - - `Async`: the default value, preseve the current behavior. - - `LockStep`: update PVCs first, then update Pods. See below for details. +Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to +specify how to coordinate the update of PVCs and Pods. Possible values are: +- `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. +- `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. Changes to StatefultSet `status`: @@ -264,9 +261,9 @@ Additionally collect the status of managed PVCs, and show them in the StatefulSe For each PVC in the template: - compatible: the number of PVCs that are compatible with the template. - These replicas will not be blocked on Pod recreation if `volumeClaimSyncStrategy` is `LockStep`. + These replicas will not be blocked on Pod recreation. - updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). -- overSized: the number of PVCs that are over-sized. +- overSized: the number of PVCs that are larger than the template. - totalCapacity: the sum of `status.capacity` of all the PVCs. Some fields in the `status` are also updated to reflect the staus of the PVCs: @@ -305,9 +302,8 @@ Warning FailedCreate 3m58s (x7 over 3m58s) statefulset-controller cre just like Pod template. When to update PVCs: -1. If `volumeClaimSyncStrategy` is `LockStep`, - before advancing `status.updatedReplicas` to the next replica, - additionally check that the PVCs of the next replica are +1. before advancing `status.updatedReplicas` to the next replica, + check that the PVCs of the next replica are [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually. @@ -369,31 +365,19 @@ To expand the volumes managed by a StatefulSet, we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused. -#### Story 2: Migrating Between Storage Providers +#### Story 2: Shinking the PV by Re-creating PVC -We decide to switch from home-made local storage to the storage provided by a cloud provider. +After running our app for a while, we optimize the data layout and reduce the required storage size. +Now we want to shrink the PVs to save cost. We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. +We also don't have the infrastructure to migrate between two StatefulSets. Our app can automatically rebuild the data in the new storage from other replicas. So we update the `volumeClaimTemplates` of the StatefulSet, delete the PVC and Pod of one replica, let the controller re-create them, then monitor the rebuild process. Once the rebuild completes successfully, we proceed to the next replica. -#### Story 3: Migrating Between Different Implementations of the Same Storage Provider - -Our storage provider has a new version that provides new features, but can not be upgraded in-place. -We can prepare some new PersistentVolumes using the new version, but referencing the same disk -from the provider as the in-use PVs. -Then the same update process as Story 2 can be used. -Although the PVCs are recreated, the data is preserved, so no rebuild is needed. - -#### Story 4: Shinking the PV by Re-creating PVC - -After running our app for a while, we optimize the data layout and reduce the required storage size. -Now we want to shrink the PVs to save cost. -The same process as Story 2 can be used. - -#### Story 5: Asymmetric Replicas +#### Story 3: Asymmetric Replicas The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. @@ -413,23 +397,8 @@ When designing the `InPlace` update strategy, we update the PVC like how we re-c i.e. we update the PVC whenever we would re-create the Pod; we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. -`volumeClaimSyncStrategy` is introduce to keep capability of current deployed workloads. -StatefulSet currently accepts and uses existing PVCs that is not created by the controller, -So the `volumeClaimTemplates` and PVC can differ even before this enhancement. -Some users may choose to keep the PVCs of different replicas different. -We should not block the Pod updates for them. - -If `volumeClaimSyncStrategy` is `Async`, -we just ignore the PVCs that cannot be updated to be compatible with the new `volumeClaimTemplates`, -as what we do currently. -Of course, we report this in the status of the StatefulSet. - -However, a workload may rely on some features provided by a specific PVC, -So we should provide a way to coordinate the update. -That's why we also need `LockStep`. - The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, -so that a `LockStep` StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. +so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. ### Risks and Mitigations @@ -690,7 +659,7 @@ automations, so be extremely careful here. The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. Otherwise No. -If `volumeClaimUpdateStrategy` is `OnDelete` and `volumeClaimSyncStrategy` is `Async` (the default values), +If `volumeClaimUpdateStrategy` is `OnDelete` (the default values), the behavior of StatefulSet controller is almost the same as before. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -1038,29 +1007,29 @@ information to express the idea and why it was not acceptable. [KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks: -* Not reverting the `volumeClaimTemplates` when rollback the StatefulSet is confusing, -* The validation can be a barrier when recovering from a failed update. - If RecoverVolumeExpansionFailure feature gate is enabled, we can recover from failed expansion by decreasing the size. -* The validation is racy, especially when recovering from failed expansion. - We still need to consider most abnormal cases even we do those validations. -* This does not match the pattern of existing behaviors. - That is, the controller should take the expected state, retry as needed to reach that state. - For example, StatefulSet will not reject a invalid `serviceAccountName`. +* If we disallow decreasing, we make the editing a one-way road. + If a user edited it then found it was a mistake, there is no way back. + The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable. +* To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. + We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). + For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, + basically duplicate all the validations we have done to PVC. + And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. + I see this as a major drawback of KEP-0661 that I want to avoid in this KEP. +* Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change. +* The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha. * `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs. +* It violates the high-level design. + The template describes a desired final state, rather than an immediate instruction. + A lot of things can happen externally after we update the template. + For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. + We don't want to reject it just because the StorageClass is applied after the StatefulSet. -### Only support for updating storage size +### Support for updating arbitrary fields in `volumeClaimTemplates` -[KEP-0661] only enables expanding the volume by updating `volumeClaimTemplates[*].spec.resources.requests.storage`. -However, -1. because the StatefulSet can take pre-existing PVCs, -we still need to consider what to do when template and PVC don't match. -The complexity of this proposal will not decrease much if we only support expanding the volume. -By enabling arbitrary updating to the `volumeClaimTemplates`, -we just acknowledge and officially support this use case. -1. We have VAC now, which is expected to go to beta soon. -And can be patched to existing PVC. We should also support patching VAC -by updating `volumeClaimTemplates`. +No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. +This is just an extra validation in APIServer. We may remove it later if we find it is not needed. ### Patch PVCs regardless of the immutable fields @@ -1072,6 +1041,16 @@ Only patching VAC if storage class matches is a very logical choice. And we'd better follow the same operation model for all mutable fields. +## Support for automatically skip not managed PVCs + +Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy`. +If it is set to `Async`, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match). + +The rules to determine what PVCs are managed are a little bit tricky. +We have to check each field, and determine what to do for each field. + +And still, we want to keep the changes small. + [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 ## Infrastructure Needed (Optional) From d35c83af2f65878851f13a0518550ac4a6e61432 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Thu, 22 Aug 2024 13:58:10 +0800 Subject: [PATCH 16/30] update with the implementation --- .../README.md | 37 ++++++++++++++++--- 1 file changed, 31 insertions(+), 6 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 7d69552f181..fb47a510470 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -107,8 +107,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Alternatives](#alternatives) - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) - - [Patch PVCs regardless of the immutable fields](#patch-pvcs-regardless-of-the-immutable-fields) -- [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Patch PVC size regardless of the immutable fields](#patch-pvc-size-regardless-of-the-immutable-fields) + - [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels) + - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -395,7 +397,7 @@ This might be a good place to talk about core concepts and how they relate. When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. i.e. we update the PVC whenever we would re-create the Pod; -we wait for the PVC to be compatible whenever we would wait for the Pod to be ready. +we wait for the PVC to be compatible whenever we would wait for the Pod to be available. The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. @@ -429,6 +431,9 @@ We can use Server Side Apply to patch the PVCs, so that we will not interfere with the user's manual changes, e.g. to `metadata.labels` and `metadata.annotations`. +New invariants established about PVCs: +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. + ### Test Plan Will add unit tests for the StatefulSet controller with and without the feature gate, -`volumeClaimUpdateStrategy` set to `InPlace` and `OnDelete` respectively. +`volumeClaimUpdatePolicy` set to `InPlace` and `OnDelete` respectively. ### Rollout, Upgrade and Rollback Planning From 046c7710433c3051a7ccd9d6662d1e6fc38fb590 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 25 May 2025 01:00:08 +0800 Subject: [PATCH 18/30] merge template update --- .../README.md | 38 +++++++++++++++---- .../kep.yaml | 2 + 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 3a12a95962c..49a18074a0e 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -484,21 +484,28 @@ extending the production code to implement this enhancement. ##### Integration tests -- : +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) ##### e2e tests @@ -506,13 +513,18 @@ https://storage.googleapis.com/k8s-triage/index.html This question should be filled when targeting a release. For Alpha, describe what tests will be added to ensure proper quality of the enhancement. -For Beta and GA, add links to added tests together with links to k8s-triage for those tests: -https://storage.googleapis.com/k8s-triage/index.html +For Beta and GA, document that tests have been written, +have been executed regularly, and have been stable. +This can be done with: +- permalinks to the GitHub source code +- links to the periodic job (typically a job owned by the SIG responsible for the feature), filtered by the test name +- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html) We expect no non-infra related flakes in the last month as a GA graduation criteria. +If e2e tests are not necessary or useful, explain why. --> -- : +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) ### Graduation Criteria @@ -553,13 +565,23 @@ Below are some examples to consider, in addition to the aforementioned [maturity - Gather feedback from developers and surveys - Complete features A, B, C - Additional tests are in Testgrid and linked in KEP +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved + +**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified #### GA - N examples of real-world usage - N installs -- More rigorous forms of testing—e.g., downgrade tests and scalability tests - Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved + +**Note:** GA criteria must not include any functional, security, monitoring, or testing requirements. Those must be beta requirements. **Note:** Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index 89587d8f26f..20b702d4045 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -26,6 +26,8 @@ replaces: - "https://github.com/kubernetes/enhancements/pull/3412" # Previous attempt on 0611 # The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. stage: alpha # The most recent milestone for which work toward delivery of this KEP has been From bad859d47f37e3b797c8d436364914b044550e1c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Sun, 25 May 2025 00:51:14 +0800 Subject: [PATCH 19/30] PVC compatible => ready --- .../README.md | 403 ++++++++++++------ .../kep.yaml | 6 +- 2 files changed, 268 insertions(+), 141 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 49a18074a0e..7f9bb65e358 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -78,12 +78,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Kubernetes API Changes](#kubernetes-api-changes) - - [Updated Reconciliation Logic](#updated-reconciliation-logic) - - [What PVC is compatible](#what-pvc-is-compatible) + - [Kubernetes Controller Changes](#kubernetes-controller-changes) - [User Stories (Optional)](#user-stories-optional) - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) - - [Story 2: Shinking the PV by Re-creating PVC](#story-2-shinking-the-pv-by-re-creating-pvc) - - [Story 3: Asymmetric Replicas](#story-3-asymmetric-replicas) + - [Story 2: Asymmetric Replicas](#story-2-asymmetric-replicas) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -93,6 +91,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -178,12 +179,16 @@ updates. --> Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. -This enhancement proposes to support modifications to the `volumeClaimTemplates`, -automatically patching the associated PersistentVolumeClaim objects if applicable. -Currently, PVC `spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` -can be patched. -All the updates to PersistentVolumeClaim can be coordinated with `Pod` updates -to honor any dependencies between them. +This enhancement proposes relaxing validation of StatefulSet's VolumeClaim template. +Specifically, we will allow modifying the following fields of `spec.volumeClaimTemplates`: +* increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) +* modifying Volume AttributesClass used by the claim(`spec.volumeClaimTemplates.spec.volumeAttributesClassName`) +* modifying VolumeClaim template's labels(`spec.volumeClaimTemplates.metadata.labels`) +* modifying VolumeClaim template's annotations(`spec.volumeClaimTemplates.metadata.annotations`) +When `volumeClaimTemplates` is updated, the StatefulSet controller will reconcile the +PersistentVolumeClaims in the StatefulSet's pods. +The behavior of updating PersistentVolumeClaim is similar to updating Pod. +The updates to PersistentVolumeClaim will be coordinated with Pod updates to honor any dependencies between them. ## Motivation @@ -210,11 +215,14 @@ This brings many headaches in a continuously evolving environment. List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`. -* Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods. -* Support updating PersistentVolumeClaim objects with `OnDelete` strategy. -* Coordinate updates to `Pod` and PersistentVolumeClaim objects. -* Provide accurate status and error messages to users when the update fails. +* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`, specifically: + * increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) + * modifying Volume AttributesClass used by the claim(`spec.volumeClaimTemplates.spec.volumeAttributesClassName`) + * modifying VolumeClaim template's labels(`spec.volumeClaimTemplates.metadata.labels`) + * modifying VolumeClaim template's annotations(`spec.volumeClaimTemplates.metadata.annotations`) +* Automatically patch the existing PersistentVolumeClaim objects, without interrupting the running Pods. +* Add `.spec.volumeClaimUpdatePolicy` allowing users to decide how the volume claim will be updated: in-place or on PVC deletion. + ### Non-Goals @@ -226,6 +234,7 @@ and make progress. * Validate the updated `volumeClaimTemplates` as how PVC patch does. * Update ephemeral volumes. * Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs. +* Support for volumes that only support offline expansion. ## Proposal @@ -238,39 +247,24 @@ implementation. What is the desired outcome and how do we measure success?. The "Design Details" section below is for the real nitty-gritty. --> -1. Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: - * `labels` - * `annotations` - * `resources.requests.storage` - * `volumeAttributesClassName` - -2. Modify StatefulSet controller to add PVC reconciliation logic. - -3. Collect the status of managed PVCs, and show them in the StatefulSet status. -### Kubernetes API Changes +### Kubernetes API Changes -Changes to StatefulSet `spec`: +Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: + * `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) + * `spec.volumeClaimTemplates.spec.volumeAttributesClassName` + * `spec.volumeClaimTemplates.metadata.labels` + * `spec.volumeClaimTemplates.metadata.annotations` Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to specify how to coordinate the update of PVCs and Pods. Possible values are: -- `OnDelete`: the default value, only update the PVC when the the old PVC is deleted. -- `InPlace`: patch the PVC in-place if possible. Also includes the `OnDelete` behavior. +- `OnClaimDelete`: the default value, only update the PVC when the the old PVC is deleted. +- `InPlace`: patch the PVC in-place if possible. Also includes the `OnClaimDelete` behavior. -Changes to StatefultSet `status`: Additionally collect the status of managed PVCs, and show them in the StatefulSet status. - -For each PVC in the template: -- compatible: the number of PVCs that are compatible with the template. - These replicas will not be blocked on Pod recreation. -- updating: the number of PVCs that are being updated in-place (e.g. expansion in progress). -- overSized: the number of PVCs that are larger than the template. -- totalCapacity: the sum of `status.capacity` of all the PVCs. - -Some fields in the `status` are also updated to reflect the staus of the PVCs: -- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if: - - `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; +Some fields in the `status` are updated to reflect the status of the PVCs: +- readyReplicas: in addition to pods, also consider the PVCs status. - availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` - currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs. @@ -278,78 +272,44 @@ Some fields in the `status` are also updated to reflect the staus of the PVCs: With these changes, user can still use `kubectl rollout status` to monitor the update process, both for automated patching and for the PVCs that need manual intervention. -### Updated Reconciliation Logic - -How to update PVCs: -1. If `volumeClaimUpdatePolicy` is `InPlace`, - and if `volumeClaimTemplates` and actual PVC only differ in mutable fields - (`spec.resources.requests.storage`, `spec.volumeAttributesClassName`, `metadata.labels`, and `metadata.annotations` currently), - patch the PVC to the extent possible. - - `spec.resources.requests.storage` is patched to max(template spec, PVC status) - - Do not decreasing the storage size below its current status. - Note that decrease the size in PVC spec can help recover from a failed expansion if - `RecoverVolumeExpansionFailure` feature gate is enabled. - - `spec.volumeAttributesClassName` is patched to the template value. - - `metadata.labels` and `metadata.annotations` are patched with server side apply. - -2. If it is not possible to make the PVC [compatible](#what-pvc-is-compatible), - do nothing. But when recreating a Pod and the corresponding PVC is deleting, - wait for the deletion then create a new PVC together with the new Pod (already implemented). - +A PVC is considered ready if: +* PVC's `status.capacity.storage` is greater than or equal to min(template spec, PVC spec). + If the template is 10Gi, PVC is 10Gi and is expanding to 100Gi but failed, we still consider it ready. +* PVC's `status.currentVolumeAttributesClassName` equals to `spec.volumeAttributesClassName`. -3. Use either current or updated revision of the `volumeClaimTemplates` to create/update the PVC, - just like Pod template. - -When to update PVCs: -1. before advancing `status.updatedReplicas` to the next replica, - check that the PVCs of the next replica are - [compatible](#what-pvc-is-compatible) with the new `volumeClaimTemplates`. - If not, and if we are not going to patch it automatically, - wait for the user to delete/update the old PVC manually. - -2. When doing rolling update, A replica is considered ready if the Pod is ready - and all its volumes are not being updated in-place. - Wait for a replica to be ready for at least `minReadySeconds` before proceeding to the next replica. - -3. Whenever we check for Pod update, also check for PVCs update. - e.g.: - - If `spec.updateStrategy.type` is `RollingUpdate`, - update the PVCs in the order from the largest ordinal to the smallest. - - If `spec.updateStrategy.type` is `OnDelete`, - Only update the PVC when the Pod is deleted. - -4. When patching the PVC, if we also re-create the Pod, - update the PVC after old Pod deleted, together with creating new pod. - Otherwise, if pod is not changed, update the PVC only. +A new label `controller-revision-hash` is added to the PVCs, +to ensure we have the correct version of PVC in cache when determining whether the PVC is ready. -Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. +### Kubernetes Controller Changes -- If the PVC update fails, we should block the update process. - If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. - We should retry and report events for this. - The events and status should look like those when the Pod creation fails. +Additionally watch for events from PVCs, in order to kickoff the update process when the PVC becomes ready. -- While waiting for the PVC to reach the compatible state, - We should update status, just like what we do when waiting for Pod to be ready. - We should block the update process if the PVC is never compatible. +If `volumeClaimUpdatePolicy` is `OnClaimDelete`, nothing changes. This field acts like a per-StatefulSet feature-gate. +The changes described below applies only for `InPlace` policy. -- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, - similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), - user may need to manually deal with the blocking PVCs (update or delete them). +Include `volumeClaimTemplates` in the `ControllerRevision`. + +Since modifying `volumeClaimTemplates` will change the hash, +Add support for updating `controller-revision-hash` label of the Pod without deleting and recreating the Pod, +if the pod template is not changed. + +Before creating a new Pod, or, if the Pod template is not changed, updating the label, +use server-side apply to update the PVCs used by the Pod. + +The patch used in server-side apply is the volumeClaimTemplates in the StatefulSet, except: +* `spec.resources.requests.storage` is set to max(template `spec.resources.requests.storage`, PVC `status.capacity.storage`), + so that we will not decrease the storage size below its current status. + Note that we still may decrease the size in PVC spec, + which can help recover from a failed expansion if `RecoverVolumeExpansionFailure` feature gate is enabled. +* `controller-revision-hash` label is added to the PVCs. +Naturally, most of the update control logics also apply to PVCs. +* Wait for PVCs to be ready for at least `minReadySeconds` before proceeding to the next replica. +* If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. +* If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually. -### What PVC is compatible +When creating new PVCs, use the `volumeClaimTemplates` from the same revision that is used to create the Pod. -A PVC is compatible with the template if: -- All the immutable fields match exactly; and -- `metadata.labels` and `metadata.annotations` of PVC is a superset of the template; and -- `status.capacity.storage` of PVC is greater than or equal to - the `spec.resources.requests.storage` of the template; and -- `status.currentVolumeAttributesClassName` of PVC is equal to - the `spec.volumeAttributesClassName` of the template. ### User Stories (Optional) @@ -367,6 +327,7 @@ To expand the volumes managed by a StatefulSet, we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused. + -#### Story 3: Asymmetric Replicas +#### Story 2: Asymmetric Replicas The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. @@ -395,12 +356,19 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> -When designing the `InPlace` update strategy, we update the PVC like how we re-create the Pod. -i.e. we update the PVC whenever we would re-create the Pod; -we wait for the PVC to be compatible whenever we would wait for the Pod to be available. +When designing the `InPlace` update strategy, we want to reuse the infrastructures controlling Pod rollout. +We apply the changes to the PVCs before we set new `controller-revision-hash` label. +New invariance established about PVCs: +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. + +We introduce `controller-revision-hash` label on PVCs to: +* Record where have progressed, to ensure each PVC is only updated once per rollout. +* When waiting for PVCs to become ready, we can check the label to ensure we got the correct version in the informer cache. -The StatefulSet controller should also keeps the current and updated revision of the `volumeClaimTemplates`, -so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated. +The rational of using server-side apply to update PVCs: +Avoid interference with other controllers or human operators that operate on PVCs. +* If additional annotations/labels are added to the PVCs by others, do not remove them. +* If storage class is not set in the template, We should not care the storage class of the PVCs. ### Risks and Mitigations @@ -415,8 +383,29 @@ How will UX be reviewed, and by whom? Consider including folks who also work outside the SIG or subproject. --> -TODO: Recover from failed in-place update (insufficient storage, etc.) -What else is needed in addition to revert the StatefulSet spec? + + +Since we don't allow decreasing the storage size of `volumeClaimTemplates`, +it is not possible to run `kubectl rollout undo` after increasing it. +We may loose this restriction in the future. +But unfortunately, since volume expansion cannot be fully cancelled, +undoing StatefulSet changes may not be enough to revert the system to the previous state, +but should be enough to unblock StatefulSet rollout. + +The user who can update the StatefulSet gains implicit permission to update the PVCs. +This can incur extra fee to cloud providers. +Cluster administrators should setup appropriate quota or validation to mitigate this. + +Interfering with other controllers or human operators. +Over the years, the user may have deployed third-party controllers to e.g., expand the volume automatically. +We should not interfere with them. Like Pods, we use `controller-revision-hash` label to record whether we have updated the PVCs. +If the `controller-revision-hash` label on either Pod or PVC is already matched, we will not touch the PVCs again. +So we will not interfere with them as long as the `controller-revision-hash` label is preserved by them. + +New Pod may still see old PVC configuration. +We already ensure that the PVC is updated before the new Pod is created. +However, the operation on PVCs can be asynchronous. And expansion may not finish without a running Pod. + ## Design Details @@ -427,12 +416,56 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -We can use Server Side Apply to patch the PVCs, -so that we will not interfere with the user's manual changes, -e.g. to `metadata.labels` and `metadata.annotations`. +When updating volumeClaimTemplates along with pod template, we will go through the following steps: +1. Delete the old pod. +2. Apply the changes to the PVCs used by this replica. +3. Create the new pod with new `controller-revision-hash` label. +4. Wait for the new pod and PVCs to be ready. +5. Advance to the next replica and repeat from step 1. + +When only updating the volumeClaimTemplates: +1. Apply the changes to the PVCs used by this replica. +2. Update the pod with new `controller-revision-hash` label. +3. Wait for the PVCs to be ready. +4. Advance to the next replica and repeat from step 1. + +Assuming we are updating a replica from revision A to revision B: + +| Pod | PVC | Action | +| --- | --- | --- | +| - | not existing | create PVC at revision B | +| not existing | at revision A | update PVC to revision B | +| not existing | at revision B | create Pod at revision B | +| at revision A | at revision A | update PVC to revision B | +| at revision A | at revision B | delete Pod or update Pod label | +| at revision B | existing | wait for Pod/PVC to be ready | + +Note that when Pod is at revision B but PVC is at revision A, we will not update PVC. +Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` when the feature-gate of KCM is disabled, +or disable the previously enabled feature-gate. +We require user to initiate another rollout to update the PVCs, to avoid any surprise. + +Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. + +- If the PVC update fails, we should block the StatefulSet rollout process. + This will also block the creation of new Pod. + We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod. + If this still happens (e.g., because of webhook), We should retry and report events for this. + The events and status should look like those when the Pod creation fails. + +- While waiting for the PVC to become ready, + We should update status, just like what we do when waiting for Pod to be ready. + We should block the StatefulSet rollout process if the PVC is never ready. + +- When individual PVC failed to become ready, the user can update that PVC manually to bring it back to ready. + +- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, + similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), + user may need to manually deal with the blocking PVCs (update or delete them). + +In all cases, if the user determines the failure of updating PVCs is not critical, +he can change `volumeClaimUpdatePolicy` back to `OnClaimDelete` to unblock normal Pod rollout. -New invariants established about PVCs: -If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. ### Test Plan @@ -447,7 +480,7 @@ when drafting this test plan. [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md --> -[ ] I/we understand the owners of the involved components may require updates to +[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. @@ -479,7 +512,10 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +For alpha, the core package we will be touching: +- `pkg/controller/statefulset`: `2025-05-25` - `86.5%` +- `pkg/controller/history`: `2025-05-25` - `84.5` +- `pkg/apis/apps/validation`: `2025-05-25` - `92.5%` ##### Integration tests @@ -507,6 +543,12 @@ This can be done with: - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) +- When the feature gate is enabled, existing StatefulSets gains a default `volumeClaimUpdatePolicy` of `OnClaimDelete`, and can be updated to `InPlace`. + Then disable the feature gate, `volumeClaimUpdatePolicy` field should remain unchanged, but user can clear it manually. + +- When the feature gate is disabled in the mid of the PVC rollout, we should not update or wait for the PVCs anymore. + `volumeClaimTemplate` should remains in the controllerRevision. And the current rollout should finish successfully. + ##### e2e tests +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Complete features A, B, C +- Additional tests are in Testgrid and linked in KEP +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved + +**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified + +#### GA + +- N examples of real-world usage +- N installs +- Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved + + ### Upgrade / Downgrade Strategy +No changes required to maintain previous behavior. + +To make use of the enhancement, user can update `volumeClaimTemplates` of existing StatefulSets. +He can also update `volumeClaimUpdatePolicy` to `InPlace` in order to rollout the changes automatically. + ### Version Skew Strategy +No coordinating behavior in the control plane and nodes. + +Should enable this feature for APIServer before kube-controller-manager. +An n-1 kube-controller-manager should ignore the `volumeClaimUpdatePolicy` field and never touch PVCs. +It should always create PVCs with the latest `volumeClaimTemplates`. + +If `volumeClaimUpdatePolicy` is set to `InPlace`, +when new kube-controller-manager starts, it should pick this up and start rolling out PVCs immediately. + +If `volumeClaimUpdatePolicy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, +kube-controller-manager should still update the controllerRevision and label on Pods. +After that, when the feature-gate of kube-controller-manager is enabled, +user needs to update the `volumeClaimTemplates` again to trigger another rollout. + ## Production Readiness Review Questionnaire The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. +StatefulSets gains a new field `volumeClaimUpdatePolicy` with default value `OnClaimDelete`. Otherwise No. -If `volumeClaimUpdatePolicy` is `OnDelete` (the default values), +If `volumeClaimUpdatePolicy` is `OnClaimDelete` (the default values), the behavior of StatefulSet controller is almost the same as before. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -704,13 +796,14 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, disable this feature gate should not leave any inconsistent state. -If the `volumeClaimTemplates` is updated then the feature is disabled and the StatefulSet is rolled back, -The `volumeClaimTemplates` will be kept as the latest version, and the history of them will be lost. +The `volumeClaimUpdatePolicy` field will not be cleared automatically. +When it is set to `InPlace`, `volumeClaimTemplates` also remains in the controllerRevision. +User can rollback each StatefulSet manually by deleting the `volumeClaimUpdatePolicy` field. ###### What happens if we reenable the feature if it was previously rolled back? -If the `volumeClaimUpdatePolicy` is already set to `InPlace` reenable the feature -will kick off the update process immediately. +If the `volumeClaimUpdatePolicy` is already set to `InPlace`, +user needs to update the `volumeClaimTemplates` again to trigger a rollout. ###### Are there any tests for feature enablement/disablement? @@ -727,7 +820,9 @@ You can take a look at one potential example of such test in: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 --> Will add unit tests for the StatefulSet controller with and without the feature gate, -`volumeClaimUpdatePolicy` set to `InPlace` and `OnDelete` respectively. +`volumeClaimUpdatePolicy` set to `InPlace` and `OnClaimDelete` respectively. + +Will add unit tests for exercising the switch of feature gate when `volumeClaimUpdatePolicy` already set. ### Rollout, Upgrade and Rollback Planning @@ -863,6 +958,9 @@ and creating new ones, as well as about cluster-level services (e.g. DNS): - Impact of its outage on the feature: - Impact of its degraded performance or high-error rates on the feature: --> +CSI drivers with in-place ExpandVolume or ModifyVolume capabilities, +when `spec.resources.requests.storage` or `spec.volumeAttributesClassName` of `volumeClaimTemplates` is updated respectively. + ### Scalability @@ -892,14 +990,14 @@ Focusing mostly on: --> - PATCH StatefulSet - kubectl or other user agents -- PATCH PersistentVolumeClaim - - 1 per updated PVC in the StatefulSet (number of updated claim template * replica) +- PATCH PersistentVolumeClaim (server-side apply) + - 1 per PVC in the StatefulSet (number of updated claim template * replica) - StatefulSet controller (in KCM) - triggered by the StatefulSet spec update -- PATCH StatefulSet status - - 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica) - - StatefulSet controller (in KCM) - - triggered by the StatefulSet spec update and PVC status update + +StatefulSet controller will watch PVC updates. +(although statefulset controller does not watch PVCs before, KCM does) + ###### Will enabling / using this feature result in introducing new API types? @@ -918,7 +1016,7 @@ Describe them, providing: - Which API(s): - Estimated increase: --> -Not directly. The cloud provider may be called when the PVCs are updated. +Not directly. The cloud provider may be called when the PVCs are updated, by CSI. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? @@ -929,8 +1027,9 @@ Describe them, providing: - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> StatefulSet: -- `spec`: 2 new enum fields, ~10B -- `status`: 4 new integer fields, ~10B +- `spec`: 1 new enum fields, ~10B +PersistentVolumeClaim: +- new label `controller-revision-hash` of size 32B ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -986,6 +1085,11 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? +Not very different from the current StatefulSet controller workflow. + +If the API server and/or etcd is unavailable, we either cannot apply the update to PVCs, or cannot gather status of PVCs. +In both cases, the rollout will be blocked until the API server and/or etcd is available again. + ###### What are other known failure modes? +- Rollout of the StatefulSet blocked due to failing to update PVCs + - Detection: apiserver_request_total{resource="persistentvolumeclaims",verb="patch",code!="200"} increased. Events on StatefulSet. + - Mitigations: + - Undo `volumeClaimTemplates` changes + - Set `volumeClaimUpdatePolicy` to `OnClaimDelete` + - Diagnostics: Events on StatefulSet + - Testing: Will test the Event is emitted + +- Rollout of the StatefulSet blocked due to PVCs never becomes ready, expansion or modify volume failed + - Detection: Events on PVC. controller_{modify,expand}_volume_errors_total metrics on external-resizer + - Mitigations: + - Undo `volumeClaimTemplates` changes + - Set `volumeClaimUpdatePolicy` to `OnClaimDelete` + - Edit PVC manually to correct the issue + - Diagnostics: Events on PVC, logs of external-resizer + - Testing: No. the error is already reported on the PVC, by external-resizer. + + ###### What steps should be taken if SLOs are not being met to determine the problem? +When SLOs are not being met, events of PVC or StatefulSet are emitted. +If problem is not determined from events, operator should check whether the PVC spec is updated correctly. +If so, follow the troubleshooting instructions of expanding or modifying volume. +If not, look into the KCM log to determine why the PVC is not updated, rasing the log level if necessary. + ## Implementation History * Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`, specifically: * increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) - * modifying Volume AttributesClass used by the claim(`spec.volumeClaimTemplates.spec.volumeAttributesClassName`) - * modifying VolumeClaim template's labels(`spec.volumeClaimTemplates.metadata.labels`) - * modifying VolumeClaim template's annotations(`spec.volumeClaimTemplates.metadata.annotations`) -* Automatically patch the existing PersistentVolumeClaim objects, without interrupting the running Pods. + * modifying Volume AttributesClass used by the claim( `spec.volumeClaimTemplates.spec.volumeAttributesClassName`) + * modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`) + * modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`) * Add `.spec.volumeClaimUpdatePolicy` allowing users to decide how the volume claim will be updated: in-place or on PVC deletion. @@ -264,8 +263,7 @@ specify how to coordinate the update of PVCs and Pods. Possible values are: Additionally collect the status of managed PVCs, and show them in the StatefulSet status. Some fields in the `status` are updated to reflect the status of the PVCs: -- readyReplicas: in addition to pods, also consider the PVCs status. -- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` +- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use. - currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs. @@ -303,7 +301,7 @@ The patch used in server-side apply is the volumeClaimTemplates in the StatefulS which can help recover from a failed expansion if `RecoverVolumeExpansionFailure` feature gate is enabled. * `controller-revision-hash` label is added to the PVCs. -Naturally, most of the update control logics also apply to PVCs. +Naturally, most of the update control logic also applies to PVCs. * Wait for PVCs to be ready for at least `minReadySeconds` before proceeding to the next replica. * If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. * If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually. @@ -416,6 +414,10 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> +When `volumeClaimUpdatePolicy` is `OnClaimDelete`, APIServer should accept the changes to `volumeClaimTemplates`, +but StatefulSet controller should not touch the PVCs and preserve the current behaviour. +Following describes the workflow when `volumeClaimUpdatePolicy` is `InPlace`. + When updating volumeClaimTemplates along with pod template, we will go through the following steps: 1. Delete the old pod. 2. Apply the changes to the PVCs used by this replica. @@ -648,14 +650,14 @@ in back-to-back releases. #### Alpha - Feature implemented behind a feature flag -- Initial e2e tests completed and enabled +- Initial unit, integration and e2e tests completed #### Beta - Gather feedback from developers and surveys -- Complete features A, B, C +- Complete features: StatefulSet status reporting and `kubectl rollout status` support. - Additional tests are in Testgrid and linked in KEP -- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- Downgrade tests and scalability tests - All functionality completed - All security enforcement completed - All monitoring requirements completed @@ -666,8 +668,7 @@ in back-to-back releases. #### GA -- N examples of real-world usage -- N installs +- 3 examples of real-world usage - Allowing time for feedback - All issues and gaps identified as feedback during beta are resolved @@ -689,7 +690,7 @@ enhancement: No changes required to maintain previous behavior. To make use of the enhancement, user can update `volumeClaimTemplates` of existing StatefulSets. -He can also update `volumeClaimUpdatePolicy` to `InPlace` in order to rollout the changes automatically. +One can also update `volumeClaimUpdatePolicy` to `InPlace` in order to rollout the changes automatically. ### Version Skew Strategy @@ -706,13 +707,13 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> -No coordinating behavior in the control plane and nodes. +No coordinating between the control plane and nodes are required, since this KEP does not involve nodes. Should enable this feature for APIServer before kube-controller-manager. An n-1 kube-controller-manager should ignore the `volumeClaimUpdatePolicy` field and never touch PVCs. It should always create PVCs with the latest `volumeClaimTemplates`. -If `volumeClaimUpdatePolicy` is set to `InPlace`, +If `volumeClaimUpdatePolicy` is set to `InPlace` while the kube-controller-manager is down, when new kube-controller-manager starts, it should pick this up and start rolling out PVCs immediately. If `volumeClaimUpdatePolicy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, @@ -1142,6 +1143,8 @@ Major milestones might include: - the version of Kubernetes where the KEP graduated to general availability - when the KEP was retired or superseded --> +- 2024-05-17: initial version +- 2025-06-09: targeting v1.34 for alpha ## Drawbacks diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index a59224c47fa..ef38ccf1298 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -6,7 +6,7 @@ authors: owning-sig: sig-apps participating-sigs: - sig-storage -status: provisional +status: implementable creation-date: 2024-05-17 reviewers: - "@kow3ns" From 69039de3eb066b73ed54b8e77a8c1a046eb2892d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Wed, 11 Jun 2025 00:07:45 +0800 Subject: [PATCH 21/30] Not consider `minReadySeconds` for PVCs only update --- .../4650-stateful-set-update-claim-template/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index f6b7c2f3de3..25059674299 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -302,9 +302,11 @@ The patch used in server-side apply is the volumeClaimTemplates in the StatefulS * `controller-revision-hash` label is added to the PVCs. Naturally, most of the update control logic also applies to PVCs. -* Wait for PVCs to be ready for at least `minReadySeconds` before proceeding to the next replica. * If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. * If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually. +However, `minReadySeconds` is not considered when only PVCs are updated. +because it is hard to determine when the PVC become ready. +And updating PVCs is unlikely to disrupt workloads, so it should be unnecessary to inject delay into the update process. When creating new PVCs, use the `volumeClaimTemplates` from the same revision that is used to create the Pod. From af712006c34d588da6c91e9822bd06f96f36eb9f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Wed, 11 Jun 2025 00:17:06 +0800 Subject: [PATCH 22/30] not integrating RecoverVolumeExpansionFailure --- .../README.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 25059674299..c1da49d4339 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -112,6 +112,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) - [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels) - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) + - [Integrate with RecoverVolumeExpansionFailure feature](#integrate-with-recovervolumeexpansionfailure-feature) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -295,10 +296,8 @@ Before creating a new Pod, or, if the Pod template is not changed, updating the use server-side apply to update the PVCs used by the Pod. The patch used in server-side apply is the volumeClaimTemplates in the StatefulSet, except: -* `spec.resources.requests.storage` is set to max(template `spec.resources.requests.storage`, PVC `status.capacity.storage`), - so that we will not decrease the storage size below its current status. - Note that we still may decrease the size in PVC spec, - which can help recover from a failed expansion if `RecoverVolumeExpansionFailure` feature gate is enabled. +* `spec.resources.requests.storage` is set to max(template `spec.resources.requests.storage`, PVC `spec.resources.requests.storage`), + so that we will never decrease the storage size. * `controller-revision-hash` label is added to the PVCs. Naturally, most of the update control logic also applies to PVCs. @@ -1232,6 +1231,16 @@ and all operations are blocked. [KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 +### Integrate with RecoverVolumeExpansionFailure feature + +We may decrease the size in PVC spec automatically to help recover from a failed expansion +if `RecoverVolumeExpansionFailure` feature gate is enabled. +However, when reducing the spec size of PVC, it must still be greater than its status (not equal to). +So we don't know what to set if `volumeClaimTemplates` is smaller than PVC status. + +User can still update PVC manually. + + ## Infrastructure Needed (Optional) @@ -186,6 +187,7 @@ Specifically, we will allow modifying the following fields of `spec.volumeClaimT * modifying Volume AttributesClass used by the claim (`spec.volumeClaimTemplates.spec.volumeAttributesClassName`) * modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`) * modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`) + When `volumeClaimTemplates` is updated, the StatefulSet controller will reconcile the PersistentVolumeClaims in the StatefulSet's pods. The behavior of updating PersistentVolumeClaim is similar to updating Pod. @@ -264,7 +266,6 @@ specify how to coordinate the update of PVCs and Pods. Possible values are: Additionally collect the status of managed PVCs, and show them in the StatefulSet status. Some fields in the `status` are updated to reflect the status of the PVCs: -- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use. - currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs. @@ -358,7 +359,7 @@ This might be a good place to talk about core concepts and how they relate. When designing the `InPlace` update strategy, we want to reuse the infrastructures controlling Pod rollout. We apply the changes to the PVCs before we set new `controller-revision-hash` label. New invariance established about PVCs: -If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A. +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready. We introduce `controller-revision-hash` label on PVCs to: * Record where have progressed, to ensure each PVC is only updated once per rollout. @@ -420,28 +421,31 @@ but StatefulSet controller should not touch the PVCs and preserve the current be Following describes the workflow when `volumeClaimUpdatePolicy` is `InPlace`. When updating volumeClaimTemplates along with pod template, we will go through the following steps: -1. Delete the old pod. -2. Apply the changes to the PVCs used by this replica. -3. Create the new pod with new `controller-revision-hash` label. -4. Wait for the new pod and PVCs to be ready. -5. Advance to the next replica and repeat from step 1. +1. Apply the changes to the PVCs used by this replica. +2. Wait for the PVCs to be ready. +3. Delete the old pod. +4. Create the new pod with new `controller-revision-hash` label. +5. Wait for the new pod to be ready. +6. Advance to the next replica and repeat from step 1. When only updating the volumeClaimTemplates: 1. Apply the changes to the PVCs used by this replica. -2. Update the pod with new `controller-revision-hash` label. -3. Wait for the PVCs to be ready. +2. Wait for the PVCs to be ready. +3. Update the pod with new `controller-revision-hash` label. 4. Advance to the next replica and repeat from step 1. Assuming we are updating a replica from revision A to revision B: | Pod | PVC | Action | | --- | --- | --- | -| - | not existing | create PVC at revision B | +| not existing | not existing | create PVC at revision B | | not existing | at revision A | update PVC to revision B | | not existing | at revision B | create Pod at revision B | +| at revision A | not existing | create PVC at revision B | | at revision A | at revision A | update PVC to revision B | -| at revision A | at revision B | delete Pod or update Pod label | -| at revision B | existing | wait for Pod/PVC to be ready | +| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label | +| at revision B | not existing | create PVC at revision B | +| at revision B | existing | wait for Pod to be ready | Note that when Pod is at revision B but PVC is at revision A, we will not update PVC. Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` when the feature-gate of KCM is disabled, @@ -451,10 +455,16 @@ We require user to initiate another rollout to update the PVCs, to avoid any sur Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. - If the PVC update fails, we should block the StatefulSet rollout process. - This will also block the creation of new Pod. - We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod. - If this still happens (e.g., because of webhook), We should retry and report events for this. + We should retry and report events for this. The events and status should look like those when the Pod creation fails. + We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods, + and user should have time to fix this manually. + The failure cases of this kind includes (but not limited to): + - immutable fields mismatch (e.g. storageClassName) + - webhook + - [storage quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota) + - [VAC quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass) + - StorageClass.allowVolumeExpansion not set to true - While waiting for the PVC to become ready, We should update status, just like what we do when waiting for Pod to be ready. @@ -465,6 +475,8 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd - If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), user may need to manually deal with the blocking PVCs (update or delete them). + - If the PVC cannot become ready because of the old Pod (e.g. unable to schedule), + user can delete the Pod and the StatefulSet controller will create a new Pod at new revision. In all cases, if the user determines the failure of updating PVCs is not critical, he can change `volumeClaimUpdatePolicy` back to `OnClaimDelete` to unblock normal Pod rollout. @@ -1240,6 +1252,21 @@ So we don't know what to set if `volumeClaimTemplates` is smaller than PVC statu User can still update PVC manually. +### Order of Pod / PVC updates + +We've considered delete the Pod while/before updating the PVC, but realized several issues: +* The admission of PVC update is fairly complex, it can fail for many reasons. + We want to make sure the Pod is still running if we cannot update the PVC. +* As described in [KEP-5381], we want to allow affinity change when the VolumeAttributesClass is updated. + Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node. + +The current order (wait for PVC ready before delete old Pod) has an extra advantage: +When Pod is ready, it is guaranteed that the PVC is ready too. +So any existing tools to monitor StatefulSet rollout process does not need to change. + +This downside is that the concurrency is lower, so the rolling update may take longer. + +[KEP-5381]: https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional ## Infrastructure Needed (Optional) From 12167436efa6f8e0517377715091fd6a417f9d63 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Wed, 18 Jun 2025 16:16:26 +0800 Subject: [PATCH 24/30] describe volumeClaimUpdatePolicy update behaviour --- .../README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index b6a8606571b..6c4ec0cabc3 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -452,6 +452,18 @@ Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` or disable the previously enabled feature-gate. We require user to initiate another rollout to update the PVCs, to avoid any surprise. +When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`, +StatefulSet controller will begin to add claim templates to ControllerRevision, +which will change its hash and trigger an rollout. +The rollout works like a volumeClaimTemplates only rollout above. +In this case, step 3 will be no-op if PVC is not changed actually (apart from adding the new controller-revision-hash label), +so the rollout should proceed really fast. + +When `volumeClaimUpdatePolicy` is updated from `InPlace` to `OnClaimDelete`, +StatefulSet controller will begin to remove claim templates to ControllerRevision, +which will change its hash and trigger an rollout. +PVCs will not be touched and Pods will be updated with new `controller-revision-hash` label. + Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. - If the PVC update fails, we should block the StatefulSet rollout process. From a708d07386bd627ad5a43fd6d104b193ca0548a3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 20 Jun 2025 14:33:11 +0800 Subject: [PATCH 25/30] discuss more on kubectl rollout undo --- .../README.md | 37 +++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 6c4ec0cabc3..de65bd5f76b 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -114,6 +114,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) - [Integrate with RecoverVolumeExpansionFailure feature](#integrate-with-recovervolumeexpansionfailure-feature) - [Order of Pod / PVC updates](#order-of-pod--pvc-updates) + - [When to track volumeClaimTemplates in ControllerRevision](#when-to-track-volumeclaimtemplates-in-controllerrevision) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -387,6 +388,7 @@ Consider including folks who also work outside the SIG or subproject. Since we don't allow decreasing the storage size of `volumeClaimTemplates`, it is not possible to run `kubectl rollout undo` after increasing it. +This may surprise users already working with StatefulSets, maybe a breaking change. We may loose this restriction in the future. But unfortunately, since volume expansion cannot be fully cancelled, undoing StatefulSet changes may not be enough to revert the system to the previous state, @@ -1280,6 +1282,41 @@ This downside is that the concurrency is lower, so the rolling update may take l [KEP-5381]: https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional +### When to track `volumeClaimTemplates` in `ControllerRevision` + +The current design tracks volumeClaimTemplates in ControllerRevision only when `volumeClaimUpdatePolicy` is set to `InPlace`. + +There are two reasons: +1. We want a new revision to trigger the rollout when `volumeClaimUpdatePolicy` is changed from `OnClaimDelete` to `InPlace`. +2. We want to avoid updating all the Pods under any StatefulSet at once when the feature-gate is enabled, to avoid overloading the control-plane. + +If we track volumeClaimTemplates whenever the feature-gate is enabled, we violate all these reasons. + +Or we can make this tri-state: +* empty/nil: the default and preserve the current behavior. +* `OnClaimDelete`: Add volumeClaimTemplate to the history, but don't update PVCs +* `InPlace`: Add volumeClaimTemplate to the history, and also update PVCs in-place + +While this resolves reason 2, it still violates reason 1. + +We can add volumeClaimUpdatePolicy to ControllerRevision to resolve reason 1. +But all the policies we already have does not present in ControllerRevision. So this is not ideal either. + +The down-side of the current design is that `kubectl rollout undo` may not work as expected sometimes. + +* If `volumeClaimUpdatePolicy` is set to `OnClaimDelete`, `kubectl rollout undo` will not undo the `volumeClaimTemplates`. +* When changing `volumeClaimUpdatePolicy` from `OnClaimDelete` to `InPlace` to trigger the rollout, `kubectl rollout undo` will be no-op. +* Consider the following history: + 1. Pod Rev1 + PVC Rev1 + `OnClaimDelete` + 2. Pod Rev2 + PVC Rev1 + `InPlace` + 3. Pod Rev2 + PVC Rev2 + `InPlace` + + Now if user revert to history 1 directly, `volumeClaimTemplates` will not be reverted. + But if the user revert to history 2, then history 1, `volumeClaimTemplates` will be reverted. + +While somewhat surprising, `kubectl rollout undo` is just a convenient method to update the StatefulSet. +User can always do the update manually. So this is not a big problem. + ## Infrastructure Needed (Optional) -When `volumeClaimUpdatePolicy` is `OnClaimDelete`, APIServer should accept the changes to `volumeClaimTemplates`, +When `volumeClaimUpdateStrategy` is `OnClaimDelete`, APIServer should accept the changes to `volumeClaimTemplates`, but StatefulSet controller should not touch the PVCs and preserve the current behaviour. -Following describes the workflow when `volumeClaimUpdatePolicy` is `InPlace`. +Following describes the workflow when `volumeClaimUpdateStrategy` is `InPlace`. When updating volumeClaimTemplates along with pod template, we will go through the following steps: 1. Apply the changes to the PVCs used by this replica. @@ -450,18 +512,18 @@ Assuming we are updating a replica from revision A to revision B: | at revision B | existing | wait for Pod to be ready | Note that when Pod is at revision B but PVC is at revision A, we will not update PVC. -Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` when the feature-gate of KCM is disabled, +Such state can only happen when user set `volumeClaimUpdateStrategy` to `InPlace` when the feature-gate of KCM is disabled, or disable the previously enabled feature-gate. We require user to initiate another rollout to update the PVCs, to avoid any surprise. -When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`, +When `volumeClaimUpdateStrategy` is updated from `OnClaimDelete` to `InPlace`, StatefulSet controller will begin to add claim templates to ControllerRevision, which will change its hash and trigger an rollout. The rollout works like a volumeClaimTemplates only rollout above. In this case, step 3 will be no-op if PVC is not changed actually (apart from adding the new controller-revision-hash label), so the rollout should proceed really fast. -When `volumeClaimUpdatePolicy` is updated from `InPlace` to `OnClaimDelete`, +When `volumeClaimUpdateStrategy` is updated from `InPlace` to `OnClaimDelete`, StatefulSet controller will begin to remove claim templates to ControllerRevision, which will change its hash and trigger an rollout. PVCs will not be touched and Pods will be updated with new `controller-revision-hash` label. @@ -493,7 +555,7 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd user can delete the Pod and the StatefulSet controller will create a new Pod at new revision. In all cases, if the user determines the failure of updating PVCs is not critical, -he can change `volumeClaimUpdatePolicy` back to `OnClaimDelete` to unblock normal Pod rollout. +he can change `volumeClaimUpdateStrategy` back to `OnClaimDelete` to unblock normal Pod rollout. ### Test Plan @@ -572,8 +634,8 @@ This can be done with: - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) -- When the feature gate is enabled, existing StatefulSets gains a default `volumeClaimUpdatePolicy` of `OnClaimDelete`, and can be updated to `InPlace`. - Then disable the feature gate, `volumeClaimUpdatePolicy` field should remain unchanged, but user can clear it manually. +- When the feature gate is enabled, existing StatefulSets gains a default `volumeClaimUpdateStrategy` of `OnClaimDelete`, and can be updated to `InPlace`. + Then disable the feature gate, `volumeClaimUpdateStrategy` field should remain unchanged, but user can clear it manually. - When the feature gate is disabled in the mid of the PVC rollout, we should not update or wait for the PVCs anymore. `volumeClaimTemplate` should remains in the controllerRevision. And the current rollout should finish successfully. @@ -597,7 +659,7 @@ If e2e tests are not necessary or useful, explain why. - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) -- When feature gate is enabled, update the StatefulSet `volumeClaimTemplates` with `volumeClaimUpdatePolicy: InPlace` can successfully expand the PVCs. +- When feature gate is enabled, update the StatefulSet `volumeClaimTemplates` with `volumeClaimUpdateStrategy: InPlace` can successfully expand the PVCs. And running Pods are not restarted. ### Graduation Criteria @@ -717,7 +779,7 @@ enhancement: No changes required to maintain previous behavior. To make use of the enhancement, user can update `volumeClaimTemplates` of existing StatefulSets. -One can also update `volumeClaimUpdatePolicy` to `InPlace` in order to rollout the changes automatically. +One can also update `volumeClaimUpdateStrategy` to `InPlace` in order to rollout the changes automatically. ### Version Skew Strategy @@ -737,13 +799,13 @@ enhancement: No coordinating between the control plane and nodes are required, since this KEP does not involve nodes. Should enable this feature for APIServer before kube-controller-manager. -An n-1 kube-controller-manager should ignore the `volumeClaimUpdatePolicy` field and never touch PVCs. +An n-1 kube-controller-manager should ignore the `volumeClaimUpdateStrategy` field and never touch PVCs. It should always create PVCs with the latest `volumeClaimTemplates`. -If `volumeClaimUpdatePolicy` is set to `InPlace` while the kube-controller-manager is down, +If `volumeClaimUpdateStrategy` is set to `InPlace` while the kube-controller-manager is down, when new kube-controller-manager starts, it should pick this up and start rolling out PVCs immediately. -If `volumeClaimUpdatePolicy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, +If `volumeClaimUpdateStrategy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, kube-controller-manager should still update the controllerRevision and label on Pods. After that, when the feature-gate of kube-controller-manager is enabled, user needs to update the `volumeClaimTemplates` again to trigger another rollout. @@ -803,10 +865,10 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. -StatefulSets gains a new field `volumeClaimUpdatePolicy` with default value `OnClaimDelete`. +StatefulSets gains a new field `volumeClaimUpdateStrategy` with default value `OnClaimDelete`. Otherwise No. -If `volumeClaimUpdatePolicy` is `OnClaimDelete` (the default values), +If `volumeClaimUpdateStrategy` is `OnClaimDelete` (the default values), the behavior of StatefulSet controller is almost the same as before. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -824,13 +886,13 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, disable this feature gate should not leave any inconsistent state. -The `volumeClaimUpdatePolicy` field will not be cleared automatically. +The `volumeClaimUpdateStrategy` field will not be cleared automatically. When it is set to `InPlace`, `volumeClaimTemplates` also remains in the controllerRevision. -User can rollback each StatefulSet manually by deleting the `volumeClaimUpdatePolicy` field. +User can rollback each StatefulSet manually by deleting the `volumeClaimUpdateStrategy` field. ###### What happens if we reenable the feature if it was previously rolled back? -If the `volumeClaimUpdatePolicy` is already set to `InPlace`, +If the `volumeClaimUpdateStrategy` is already set to `InPlace`, user needs to update the `volumeClaimTemplates` again to trigger a rollout. ###### Are there any tests for feature enablement/disablement? @@ -848,9 +910,9 @@ You can take a look at one potential example of such test in: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 --> Will add unit tests for the StatefulSet controller with and without the feature gate, -`volumeClaimUpdatePolicy` set to `InPlace` and `OnClaimDelete` respectively. +`volumeClaimUpdateStrategy` set to `InPlace` and `OnClaimDelete` respectively. -Will add unit tests for exercising the switch of feature gate when `volumeClaimUpdatePolicy` already set. +Will add unit tests for exercising the switch of feature gate when `volumeClaimUpdateStrategy` already set. ### Rollout, Upgrade and Rollback Planning @@ -1137,7 +1199,7 @@ For each of them, fill in the following information by copying the below templat - Detection: apiserver_request_total{resource="persistentvolumeclaims",verb="patch",code!="200"} increased. Events on StatefulSet. - Mitigations: - Undo `volumeClaimTemplates` changes - - Set `volumeClaimUpdatePolicy` to `OnClaimDelete` + - Set `volumeClaimUpdateStrategy` to `OnClaimDelete` - Diagnostics: Events on StatefulSet - Testing: Will test the Event is emitted @@ -1145,7 +1207,7 @@ For each of them, fill in the following information by copying the below templat - Detection: Events on PVC. controller_{modify,expand}_volume_errors_total metrics on external-resizer - Mitigations: - Undo `volumeClaimTemplates` changes - - Set `volumeClaimUpdatePolicy` to `OnClaimDelete` + - Set `volumeClaimUpdateStrategy` to `OnClaimDelete` - Edit PVC manually to correct the issue - Diagnostics: Events on PVC, logs of external-resizer - Testing: No. the error is already reported on the PVC, by external-resizer. @@ -1284,10 +1346,10 @@ This downside is that the concurrency is lower, so the rolling update may take l ### When to track `volumeClaimTemplates` in `ControllerRevision` -The current design tracks volumeClaimTemplates in ControllerRevision only when `volumeClaimUpdatePolicy` is set to `InPlace`. +The current design tracks volumeClaimTemplates in ControllerRevision only when `volumeClaimUpdateStrategy` is set to `InPlace`. There are two reasons: -1. We want a new revision to trigger the rollout when `volumeClaimUpdatePolicy` is changed from `OnClaimDelete` to `InPlace`. +1. We want a new revision to trigger the rollout when `volumeClaimUpdateStrategy` is changed from `OnClaimDelete` to `InPlace`. 2. We want to avoid updating all the Pods under any StatefulSet at once when the feature-gate is enabled, to avoid overloading the control-plane. If we track volumeClaimTemplates whenever the feature-gate is enabled, we violate all these reasons. @@ -1299,13 +1361,13 @@ Or we can make this tri-state: While this resolves reason 2, it still violates reason 1. -We can add volumeClaimUpdatePolicy to ControllerRevision to resolve reason 1. +We can add volumeClaimUpdateStrategy to ControllerRevision to resolve reason 1. But all the policies we already have does not present in ControllerRevision. So this is not ideal either. The down-side of the current design is that `kubectl rollout undo` may not work as expected sometimes. -* If `volumeClaimUpdatePolicy` is set to `OnClaimDelete`, `kubectl rollout undo` will not undo the `volumeClaimTemplates`. -* When changing `volumeClaimUpdatePolicy` from `OnClaimDelete` to `InPlace` to trigger the rollout, `kubectl rollout undo` will be no-op. +* If `volumeClaimUpdateStrategy` is set to `OnClaimDelete`, `kubectl rollout undo` will not undo the `volumeClaimTemplates`. +* When changing `volumeClaimUpdateStrategy` from `OnClaimDelete` to `InPlace` to trigger the rollout, `kubectl rollout undo` will be no-op. * Consider the following history: 1. Pod Rev1 + PVC Rev1 + `OnClaimDelete` 2. Pod Rev2 + PVC Rev1 + `InPlace` From d038f906b0612d40c8bfe8c6acdaf807297e04a3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 20 Jun 2025 15:41:45 +0800 Subject: [PATCH 27/30] target 1.35 --- .../sig-apps/4650-stateful-set-update-claim-template/kep.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml index ef38ccf1298..c4b3acd7889 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -33,11 +33,11 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.34" +latest-milestone: "v1.35" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.34" + alpha: "v1.35" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From 211f8b609c6769ffef38979fbe9fccf777a5ea9b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 20 Jun 2025 16:22:42 +0800 Subject: [PATCH 28/30] add a note about VAC being disabled by default --- keps/sig-apps/4650-stateful-set-update-claim-template/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index e1e7e9b1654..8b1997b8a2e 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -256,6 +256,7 @@ nitty-gritty. Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: * `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) * `spec.volumeClaimTemplates.spec.volumeAttributesClassName` + * Note that this field is currently disabled by default. But should not affect the progress of this KEP. * `spec.volumeClaimTemplates.metadata.labels` * `spec.volumeClaimTemplates.metadata.annotations` From 746ff2675908e5bdab2d01f6bb8ea975cb464d5d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Mon, 23 Jun 2025 14:46:33 +0800 Subject: [PATCH 29/30] update PVC even if Pod is already deleted/updated Pod may be deleted externally, e.g. evicted. --- .../README.md | 65 ++++++++++++------- 1 file changed, 40 insertions(+), 25 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 8b1997b8a2e..40b1fc1fa94 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -331,6 +331,7 @@ Some fields in the `status` are updated to reflect the status of the PVCs: } ``` We will decrease `currentReplicas` when we start to update the PVCs, and increase `updatedReplicas` when we create the new Pods. +We update `currentRevision` to `updateRevision` when all Pods and PVCs are ready. With these changes, user can still use `kubectl rollout status` to monitor the update process, both for automated patching and for the PVCs that need manual intervention. @@ -501,21 +502,26 @@ When only updating the volumeClaimTemplates: Assuming we are updating a replica from revision A to revision B: -| Pod | PVC | Action | -| --- | --- | --- | -| not existing | not existing | create PVC at revision B | -| not existing | at revision A | update PVC to revision B | -| not existing | at revision B | create Pod at revision B | -| at revision A | not existing | create PVC at revision B | -| at revision A | at revision A | update PVC to revision B | -| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label | -| at revision B | not existing | create PVC at revision B | -| at revision B | existing | wait for Pod to be ready | - -Note that when Pod is at revision B but PVC is at revision A, we will not update PVC. -Such state can only happen when user set `volumeClaimUpdateStrategy` to `InPlace` when the feature-gate of KCM is disabled, -or disable the previously enabled feature-gate. -We require user to initiate another rollout to update the PVCs, to avoid any surprise. +| # | Pod | PVC | Action | +| --- | --- | --- | --- | +| 1 | not existing | not existing | create PVC at revision B | +| 2 | not existing | at revision A | create Pod at revision B | +| 3 | not existing | at revision B | create Pod at revision B | +| 4 | at revision A | not existing | create PVC at revision B | +| 5 | at revision A | at revision A | update PVC to revision B | +| 6 | at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label | +| 7 | at revision B | not existing | create PVC at revision B | +| 8 | at revision B | at revision A | update PVC to revision B | +| 9 | at revision B | at revision B | wait for Pod and PVCs to be ready, then advance to next replica | + +A normal rollout should be like: 5 -> 6 (-> 3) -> 9. + +Normally, when Pod is at revision B, PVCs will be at revision B and already ready, unless: +* when user set `volumeClaimUpdateStrategy` to `InPlace` when the feature-gate of KCM is disabled, + or disable the previously enabled feature-gate. +* When the Pod is deleted externally, e.g. be evicted or deleted manually. + +In such cases, we will still update PVCs at 8 and wait for the PVCs to be ready at 9. When `volumeClaimUpdateStrategy` is updated from `OnClaimDelete` to `InPlace`, StatefulSet controller will begin to add claim templates to ControllerRevision, @@ -535,7 +541,7 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd We should retry and report events for this. The events and status should look like those when the Pod creation fails. We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods, - and user should have time to fix this manually. + and user should have enough time to fix this manually. The failure cases of this kind includes (but not limited to): - immutable fields mismatch (e.g. storageClassName) - webhook @@ -548,15 +554,15 @@ Failure cases: don't left too many PVCs being updated in-place. We expect to upd We should block the StatefulSet rollout process if the PVC is never ready. - When individual PVC failed to become ready, the user can update that PVC manually to bring it back to ready. + - If the PVC cannot become ready because of the old Pod (e.g. unable to schedule), + user can delete the Pod and the StatefulSet controller will create a new Pod at new revision. - If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), user may need to manually deal with the blocking PVCs (update or delete them). - - If the PVC cannot become ready because of the old Pod (e.g. unable to schedule), - user can delete the Pod and the StatefulSet controller will create a new Pod at new revision. In all cases, if the user determines the failure of updating PVCs is not critical, -he can change `volumeClaimUpdateStrategy` back to `OnClaimDelete` to unblock normal Pod rollout. +he can change `volumeClaimUpdateStrategy` back to `OnClaimDelete` to unblock normal Pod rollout immediately. ### Test Plan @@ -803,13 +809,10 @@ Should enable this feature for APIServer before kube-controller-manager. An n-1 kube-controller-manager should ignore the `volumeClaimUpdateStrategy` field and never touch PVCs. It should always create PVCs with the latest `volumeClaimTemplates`. -If `volumeClaimUpdateStrategy` is set to `InPlace` while the kube-controller-manager is down, -when new kube-controller-manager starts, it should pick this up and start rolling out PVCs immediately. - If `volumeClaimUpdateStrategy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, kube-controller-manager should still update the controllerRevision and label on Pods. After that, when the feature-gate of kube-controller-manager is enabled, -user needs to update the `volumeClaimTemplates` again to trigger another rollout. +updates to PVCs will be picked up and rollout will start automatically. ## Production Readiness Review Questionnaire @@ -1336,10 +1339,22 @@ We've considered delete the Pod while/before updating the PVC, but realized seve We want to make sure the Pod is still running if we cannot update the PVC. * As described in [KEP-5381], we want to allow affinity change when the VolumeAttributesClass is updated. Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node. +* Pod may depends on PVC updates, e.g. when the volume is full. So we should not wait for Pod to be ready before updating PVC. -The current order (wait for PVC ready before delete old Pod) has an extra advantage: -When Pod is ready, it is guaranteed that the PVC is ready too. +That left us with two options: +1. Wait for PVC ready before delete old Pod. +2. Wait for new Pod to be scheduled, with all volumes attached before update PVC. + +We choose 1 currently. This has an extra advantage: +When Pod is ready, PVCs will almost always be ready too. So any existing tools to monitor StatefulSet rollout process does not need to change. +But this is not guaranteed. If the Pod is deleted before the PVC is ready (be evicted, or manually), +we still want to ensure maximum Pod availability, so we will still create the Pod. +In this case, the Pod may be ready before PVCs are ready. + +We can choose to create Pod at current revision (instead of update revision) if PVCs are not ready. +But there may be some case where the PVCs depends on the new Pod (e.g. old Pod is not schedulable). +We don't want to block them. This downside is that the concurrency is lower, so the rolling update may take longer. From d4b37da6851a5cc6969356aa1891f479ad7d49b9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Mon, 23 Jun 2025 14:49:52 +0800 Subject: [PATCH 30/30] don't support OnDelete updateStrategy --- .../4650-stateful-set-update-claim-template/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md index 40b1fc1fa94..cbd6d185145 100644 --- a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -367,12 +367,13 @@ The patch used in server-side apply is the volumeClaimTemplates in the StatefulS * `controller-revision-hash` label is added to the PVCs. Naturally, most of the update control logic also applies to PVCs. -* If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. -* If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually. +If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. However, `minReadySeconds` is not considered when only PVCs are updated. because it is hard to determine when the PVC become ready. And updating PVCs is unlikely to disrupt workloads, so it should be unnecessary to inject delay into the update process. +If `updateStrategy` is `OnDelete`, we do not update the PVCs automatically. + When creating new PVCs, use the `volumeClaimTemplates` from the same revision that is used to create the Pod.