diff --git a/keps/prod-readiness/sig-apps/4650.yaml b/keps/prod-readiness/sig-apps/4650.yaml new file mode 100644 index 00000000000..31adc0d5d14 --- /dev/null +++ b/keps/prod-readiness/sig-apps/4650.yaml @@ -0,0 +1,3 @@ +kep-number: 4650 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/README.md b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md new file mode 100644 index 00000000000..cbd6d185145 --- /dev/null +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/README.md @@ -0,0 +1,1405 @@ + +# KEP-4650: StatefulSet Support for Updating Volume Claim Template + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Kubernetes API Changes](#kubernetes-api-changes) + - [Kubernetes Controller Changes](#kubernetes-controller-changes) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Batch Expand Volumes](#story-1-batch-expand-volumes) + - [Story 2: Asymmetric Replicas](#story-2-asymmetric-replicas) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Extensively validate the updated volumeClaimTemplates](#extensively-validate-the-updated-volumeclaimtemplates) + - [Support for updating arbitrary fields in volumeClaimTemplates](#support-for-updating-arbitrary-fields-in-volumeclaimtemplates) + - [Patch PVC size regardless of the immutable fields](#patch-pvc-size-regardless-of-the-immutable-fields) + - [Support for automatically skip not managed PVCs](#support-for-automatically-skip-not-managed-pvcs) + - [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels) + - [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas) + - [Integrate with RecoverVolumeExpansionFailure feature](#integrate-with-recovervolumeexpansionfailure-feature) + - [Order of Pod / PVC updates](#order-of-pod--pvc-updates) + - [When to track volumeClaimTemplates in ControllerRevision](#when-to-track-volumeclaimtemplates-in-controllerrevision) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Kubernetes does not support the modification of the `volumeClaimTemplates` of a StatefulSet currently. +This enhancement proposes relaxing validation of StatefulSet's VolumeClaim template. +Specifically, we will allow modifying the following fields of `spec.volumeClaimTemplates`: +* increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) +* modifying Volume AttributesClass used by the claim (`spec.volumeClaimTemplates.spec.volumeAttributesClassName`) +* modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`) +* modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`) + +When `volumeClaimTemplates` is updated, the StatefulSet controller will reconcile the +PersistentVolumeClaims in the StatefulSet's pods. +The behavior of updating PersistentVolumeClaim is similar to updating Pod. +The updates to PersistentVolumeClaim will be coordinated with Pod updates to honor any dependencies between them. + +## Motivation + + + +Currently there are very few things that users can do to update the volumes of +their existing StatefulSet deployments. +They can only expand the volumes, or modify them with VolumeAttributesClass +by updating individual PersistentVolumeClaim objects as an ad-hoc operation. +When the StatefulSet scales up, the new PVC(s) will be created with the old +config and this again needs manual intervention. +This brings many headaches in a continuously evolving environment. + +### Goals + + +* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`, specifically: + * increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) + * modifying Volume AttributesClass used by the claim( `spec.volumeClaimTemplates.spec.volumeAttributesClassName`) + * modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`) + * modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`) +* Add `.spec.volumeClaimUpdateStrategy` allowing users to decide how the volume claim will be updated: in-place or on PVC deletion. + + +### Non-Goals + + +* Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically. +* Validate the updated `volumeClaimTemplates` as how PVC patch does. +* Update ephemeral volumes. +* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs. +* Support for volumes that only support offline expansion. + + +## Proposal + + + +### Kubernetes API Changes + +Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: + * `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) + * `spec.volumeClaimTemplates.spec.volumeAttributesClassName` + * Note that this field is currently disabled by default. But should not affect the progress of this KEP. + * `spec.volumeClaimTemplates.metadata.labels` + * `spec.volumeClaimTemplates.metadata.annotations` + +Introduce a new field in StatefulSet `spec`: `volumeClaimUpdateStrategy` to +specify how to coordinate the update of PVCs and Pods. +It is defined as a struct to allow future extensions. +Possible types are: +- `OnClaimDelete`: the default value, only update the PVC when the the old PVC is deleted. +- `InPlace`: patch the PVC in-place if possible. Also includes the `OnClaimDelete` behavior. + +```golang +type StatefulSetSpec struct { + // volumeClaimUpdateStrategy indicates how PersistentVolumeClaims should be + // updated to match the volumeClaimTemplates. + // +optional + VolumeClaimUpdateStrategy StatefulSetVolumeClaimUpdateStrategy +} + +// StatefulSetVolumeClaimUpdateStrategy indicates the strategy that the StatefulSet +// controller will use to update PersistentVolumeClaims. It includes any additional parameters +// necessary to perform the update for the indicated strategy. +type StatefulSetVolumeClaimUpdateStrategy struct { + // Type indicates the type of the StatefulSetVolumeClaimUpdateStrategy. + Type StatefulSetVolumeClaimUpdateStrategyType +} + +// StatefulSetVolumeClaimUpdateStrategyType is a string enumeration type that enumerates +// all possible update strategies for the PersistentVolumeClaims managed by StatefulSet. +type StatefulSetVolumeClaimUpdateStrategyType string + +const ( + // InPlaceStatefulSetVolumeClaimUpdateStrategy indicates that the updates to + // volumeClaimTemplate will be propagated to the managed PersistentVolumeClaims + // before updating the Pods. Claims are recreated at the same revision as the corresponding Pod. + // The update is in-place without interruption or data loss. + InPlaceStatefulSetVolumeClaimUpdateStrategy StatefulSetVolumeClaimUpdateStrategyType = "InPlace" + // OnClaimDeleteStatefulSetVolumeClaimUpdateStrategy triggers the legacy behavior. + // Updates to volumeClaimTemplate only affects the new claims. Version + // tracking and ordered rolling updates are disabled. Claims are recreated + // from the StatefulSetSpec when they are manually deleted. + OnClaimDeleteStatefulSetVolumeClaimUpdateStrategy StatefulSetVolumeClaimUpdateStrategyType = "OnClaimDelete" +) +``` + +Additionally collect the status of managed PVCs, and show them in the StatefulSet status. +Some fields in the `status` are updated to reflect the status of the PVCs: +- currentRevision, updateRevision, currentReplicas, updatedReplicas + are updated to reflect the status of PVCs. + +```diff + // StatefulSetStatus represents the current state of a StatefulSet. + type StatefulSetStatus struct { +- // currentReplicas is the number of Pods created by the StatefulSet controller from the StatefulSet version ++ // currentReplicas is the number replicas with PersistentVolumeClaims updated to and Pods created from the StatefulSet version + // indicated by currentRevision. + CurrentReplicas int32 + +- // updatedReplicas is the number of Pods created by the StatefulSet controller from the StatefulSet version ++ // updatedReplicas is the number replicas with PersistentVolumeClaims updated to and Pods created from the StatefulSet version + // indicated by updateRevision. + UpdatedReplicas int32 + +- // currentRevision, if not empty, indicates the version of the StatefulSet used to generate Pods in the ++ // currentRevision, if not empty, indicates the version of the StatefulSet used to generate PersistentVolumeClaims and Pods in the + // sequence [0,currentReplicas). + CurrentRevision string + +- // updateRevision, if not empty, indicates the version of the StatefulSet used to generate Pods in the sequence ++ // updateRevision, if not empty, indicates the version of the StatefulSet used to generate PersistentVolumeClaims and Pods in the sequence + // [replicas-updatedReplicas,replicas) + UpdateRevision string + } +``` +We will decrease `currentReplicas` when we start to update the PVCs, and increase `updatedReplicas` when we create the new Pods. +We update `currentRevision` to `updateRevision` when all Pods and PVCs are ready. + +With these changes, user can still use `kubectl rollout status` to monitor the update process, +both for automated patching and for the PVCs that need manual intervention. + +A PVC is considered ready if: +* PVC's `status.capacity.storage` is greater than or equal to min(template spec, PVC spec). + If the template is 10Gi, PVC is 10Gi and is expanding to 100Gi but failed, we still consider it ready. +* PVC's `status.currentVolumeAttributesClassName` equals to `spec.volumeAttributesClassName`. + +A new label `controller-revision-hash` is added to the PVCs, +to ensure we have the correct version of PVC in cache when determining whether the PVC is ready. + +### Kubernetes Controller Changes + +Additionally watch for events from PVCs, in order to kickoff the update process when the PVC becomes ready. + +If the `volumeClaimUpdateStrategy` field is set to `OnClaimDelete`, nothing changes. +To opt in to the new behavior, the `inPlace` policy should be used. +This new behaviour is described below. + +Include `volumeClaimTemplates` in the `ControllerRevision`. + +Since modifying `volumeClaimTemplates` will change the hash, +Add support for updating `controller-revision-hash` label of the Pod without deleting and recreating the Pod, +if the pod template is not changed. + +Before deleting an old Pod, or, if the Pod template is not changed, updating the label, +use server-side apply to update the PVCs used by the Pod. + +The patch used in server-side apply is the volumeClaimTemplates in the StatefulSet, except: +* `spec.resources.requests.storage` is set to max(template `spec.resources.requests.storage`, PVC `spec.resources.requests.storage`), + so that we will never decrease the storage size. +* `controller-revision-hash` label is added to the PVCs. + +Naturally, most of the update control logic also applies to PVCs. +If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. +However, `minReadySeconds` is not considered when only PVCs are updated. +because it is hard to determine when the PVC become ready. +And updating PVCs is unlikely to disrupt workloads, so it should be unnecessary to inject delay into the update process. + +If `updateStrategy` is `OnDelete`, we do not update the PVCs automatically. + +When creating new PVCs, use the `volumeClaimTemplates` from the same revision that is used to create the Pod. + + +### User Stories (Optional) + + + +#### Story 1: Batch Expand Volumes + +We're running a CI/CD system and the end-to-end automation is desired. +To expand the volumes managed by a StatefulSet, +we can just use the same pipeline that we are already using to update the Pod. +All the test, review, approval, and rollback process can be reused. + + + +#### Story 2: Asymmetric Replicas + +The storage requirement of different replicas are not identical, +so we still want to update each PVC manually and separately. +Possibly we also update the `volumeClaimTemplates` for new replicas, +but we don't want the controller to interfere with the existing replicas. + +### Notes/Constraints/Caveats (Optional) + + + +When designing the `InPlace` update strategy, we want to reuse the infrastructures controlling Pod rollout. +We apply the changes to the PVCs before we set new `controller-revision-hash` label. +New invariance established about PVCs: +If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready. + +We introduce `controller-revision-hash` label on PVCs to: +* Record where have progressed, to ensure each PVC is only updated once per rollout. +* When waiting for PVCs to become ready, we can check the label to ensure we got the correct version in the informer cache. + +The rational of using server-side apply to update PVCs: +Avoid interference with other controllers or human operators that operate on PVCs. +* If additional annotations/labels are added to the PVCs by others, do not remove them. +* If storage class is not set in the template, We should not care the storage class of the PVCs. + +### Risks and Mitigations + + + + +Since we don't allow decreasing the storage size of `volumeClaimTemplates`, +it is not possible to run `kubectl rollout undo` after increasing it. +This may surprise users already working with StatefulSets, maybe a breaking change. +We may loose this restriction in the future. +But unfortunately, since volume expansion cannot be fully cancelled, +undoing StatefulSet changes may not be enough to revert the system to the previous state, +but should be enough to unblock StatefulSet rollout. + +The user who can update the StatefulSet gains implicit permission to update the PVCs. +This can incur extra fee to cloud providers. +Cluster administrators should setup appropriate quota or validation to mitigate this. + +Interfering with other controllers or human operators. +Over the years, the user may have deployed third-party controllers to e.g., expand the volume automatically. +We should not interfere with them. Like Pods, we use `controller-revision-hash` label to record whether we have updated the PVCs. +If the `controller-revision-hash` label on either Pod or PVC is already matched, we will not touch the PVCs again. +So we will not interfere with them as long as the `controller-revision-hash` label is preserved by them. + +New Pod may still see old PVC configuration. +We already ensure that the PVC is updated before the new Pod is created. +However, the operation on PVCs can be asynchronous. And expansion may not finish without a running Pod. + + +## Design Details + + + +When `volumeClaimUpdateStrategy` is `OnClaimDelete`, APIServer should accept the changes to `volumeClaimTemplates`, +but StatefulSet controller should not touch the PVCs and preserve the current behaviour. +Following describes the workflow when `volumeClaimUpdateStrategy` is `InPlace`. + +When updating volumeClaimTemplates along with pod template, we will go through the following steps: +1. Apply the changes to the PVCs used by this replica. +2. Wait for the PVCs to be ready. +3. Delete the old pod. +4. Create the new pod with new `controller-revision-hash` label. +5. Wait for the new pod to be ready. +6. Advance to the next replica and repeat from step 1. + +When only updating the volumeClaimTemplates: +1. Apply the changes to the PVCs used by this replica. +2. Wait for the PVCs to be ready. +3. Update the pod with new `controller-revision-hash` label. +4. Advance to the next replica and repeat from step 1. + +Assuming we are updating a replica from revision A to revision B: + +| # | Pod | PVC | Action | +| --- | --- | --- | --- | +| 1 | not existing | not existing | create PVC at revision B | +| 2 | not existing | at revision A | create Pod at revision B | +| 3 | not existing | at revision B | create Pod at revision B | +| 4 | at revision A | not existing | create PVC at revision B | +| 5 | at revision A | at revision A | update PVC to revision B | +| 6 | at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label | +| 7 | at revision B | not existing | create PVC at revision B | +| 8 | at revision B | at revision A | update PVC to revision B | +| 9 | at revision B | at revision B | wait for Pod and PVCs to be ready, then advance to next replica | + +A normal rollout should be like: 5 -> 6 (-> 3) -> 9. + +Normally, when Pod is at revision B, PVCs will be at revision B and already ready, unless: +* when user set `volumeClaimUpdateStrategy` to `InPlace` when the feature-gate of KCM is disabled, + or disable the previously enabled feature-gate. +* When the Pod is deleted externally, e.g. be evicted or deleted manually. + +In such cases, we will still update PVCs at 8 and wait for the PVCs to be ready at 9. + +When `volumeClaimUpdateStrategy` is updated from `OnClaimDelete` to `InPlace`, +StatefulSet controller will begin to add claim templates to ControllerRevision, +which will change its hash and trigger an rollout. +The rollout works like a volumeClaimTemplates only rollout above. +In this case, step 3 will be no-op if PVC is not changed actually (apart from adding the new controller-revision-hash label), +so the rollout should proceed really fast. + +When `volumeClaimUpdateStrategy` is updated from `InPlace` to `OnClaimDelete`, +StatefulSet controller will begin to remove claim templates to ControllerRevision, +which will change its hash and trigger an rollout. +PVCs will not be touched and Pods will be updated with new `controller-revision-hash` label. + +Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order. + +- If the PVC update fails, we should block the StatefulSet rollout process. + We should retry and report events for this. + The events and status should look like those when the Pod creation fails. + We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods, + and user should have enough time to fix this manually. + The failure cases of this kind includes (but not limited to): + - immutable fields mismatch (e.g. storageClassName) + - webhook + - [storage quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota) + - [VAC quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass) + - StorageClass.allowVolumeExpansion not set to true + +- While waiting for the PVC to become ready, + We should update status, just like what we do when waiting for Pod to be ready. + We should block the StatefulSet rollout process if the PVC is never ready. + +- When individual PVC failed to become ready, the user can update that PVC manually to bring it back to ready. + - If the PVC cannot become ready because of the old Pod (e.g. unable to schedule), + user can delete the Pod and the StatefulSet controller will create a new Pod at new revision. + +- If the `volumeClaimTemplates` is updated again when the previous rollout is blocked, + similar to [Pods](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback), + user may need to manually deal with the blocking PVCs (update or delete them). + +In all cases, if the user determines the failure of updating PVCs is not critical, +he can change `volumeClaimUpdateStrategy` back to `OnClaimDelete` to unblock normal Pod rollout immediately. + + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +For alpha, the core package we will be touching: +- `pkg/controller/statefulset`: `2025-05-25` - `86.5%` +- `pkg/controller/history`: `2025-05-25` - `84.5` +- `pkg/apis/apps/validation`: `2025-05-25` - `92.5%` + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +- When the feature gate is enabled, existing StatefulSets gains a default `volumeClaimUpdateStrategy` of `OnClaimDelete`, and can be updated to `InPlace`. + Then disable the feature gate, `volumeClaimUpdateStrategy` field should remain unchanged, but user can clear it manually. + +- When the feature gate is disabled in the mid of the PVC rollout, we should not update or wait for the PVCs anymore. + `volumeClaimTemplate` should remains in the controllerRevision. And the current rollout should finish successfully. + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +- When feature gate is enabled, update the StatefulSet `volumeClaimTemplates` with `volumeClaimUpdateStrategy: InPlace` can successfully expand the PVCs. + And running Pods are not restarted. + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind a feature flag +- Initial unit, integration and e2e tests completed + +#### Beta + +- Gather feedback from developers and surveys +- Complete features: StatefulSet status reporting and `kubectl rollout status` support. +- Additional tests are in Testgrid and linked in KEP +- Downgrade tests and scalability tests +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved + +**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified + +#### GA + +- 3 examples of real-world usage +- Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved + + +### Upgrade / Downgrade Strategy + + + +No changes required to maintain previous behavior. + +To make use of the enhancement, user can update `volumeClaimTemplates` of existing StatefulSets. +One can also update `volumeClaimUpdateStrategy` to `InPlace` in order to rollout the changes automatically. + +### Version Skew Strategy + + + +No coordinating between the control plane and nodes are required, since this KEP does not involve nodes. + +Should enable this feature for APIServer before kube-controller-manager. +An n-1 kube-controller-manager should ignore the `volumeClaimUpdateStrategy` field and never touch PVCs. +It should always create PVCs with the latest `volumeClaimTemplates`. + +If `volumeClaimUpdateStrategy` is set to `InPlace` when the feature-gate of kube-controller-manager is disabled, +kube-controller-manager should still update the controllerRevision and label on Pods. +After that, when the feature-gate of kube-controller-manager is enabled, +updates to PVCs will be picked up and rollout will start automatically. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StatefulSetUpdateVolumeClaimTemplate + - Components depending on the feature gate: + - kube-apiserver + - kube-controller-manager + +###### Does enabling the feature change any default behavior? + + +The update to StatefulSet `volumeClaimTemplates` will be accepted by the API server while it is previously rejected. +StatefulSets gains a new field `volumeClaimUpdateStrategy` with default value `OnClaimDelete`. + +Otherwise No. +If `volumeClaimUpdateStrategy` is `OnClaimDelete` (the default values), +the behavior of StatefulSet controller is almost the same as before. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + +Yes. Since the `volumeClaimTemplates` can already differ from the actual PVCs now, +disable this feature gate should not leave any inconsistent state. + +The `volumeClaimUpdateStrategy` field will not be cleared automatically. +When it is set to `InPlace`, `volumeClaimTemplates` also remains in the controllerRevision. +User can rollback each StatefulSet manually by deleting the `volumeClaimUpdateStrategy` field. + +###### What happens if we reenable the feature if it was previously rolled back? + +If the `volumeClaimUpdateStrategy` is already set to `InPlace`, +user needs to update the `volumeClaimTemplates` again to trigger a rollout. + +###### Are there any tests for feature enablement/disablement? + + +Will add unit tests for the StatefulSet controller with and without the feature gate, +`volumeClaimUpdateStrategy` set to `InPlace` and `OnClaimDelete` respectively. + +Will add unit tests for exercising the switch of feature gate when `volumeClaimUpdateStrategy` already set. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + +CSI drivers with in-place ExpandVolume or ModifyVolume capabilities, +when `spec.resources.requests.storage` or `spec.volumeAttributesClassName` of `volumeClaimTemplates` is updated respectively. + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +- PATCH StatefulSet + - kubectl or other user agents +- PATCH PersistentVolumeClaim (server-side apply) + - 1 per PVC in the StatefulSet (number of updated claim template * replica) + - StatefulSet controller (in KCM) + - triggered by the StatefulSet spec update + +StatefulSet controller will watch PVC updates. +(although statefulset controller does not watch PVCs before, KCM does) + + +###### Will enabling / using this feature result in introducing new API types? + + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +Not directly. The cloud provider may be called when the PVCs are updated, by CSI. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +StatefulSet: +- `spec`: 1 new enum fields, ~10B +PersistentVolumeClaim: +- new label `controller-revision-hash` of size 32B + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +The logic of StatefulSet controller is more complex, more CPU will be used. +TODO: measure the actual increase. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +Not very different from the current StatefulSet controller workflow. + +If the API server and/or etcd is unavailable, we either cannot apply the update to PVCs, or cannot gather status of PVCs. +In both cases, the rollout will be blocked until the API server and/or etcd is available again. + +###### What are other known failure modes? + + + +- Rollout of the StatefulSet blocked due to failing to update PVCs + - Detection: apiserver_request_total{resource="persistentvolumeclaims",verb="patch",code!="200"} increased. Events on StatefulSet. + - Mitigations: + - Undo `volumeClaimTemplates` changes + - Set `volumeClaimUpdateStrategy` to `OnClaimDelete` + - Diagnostics: Events on StatefulSet + - Testing: Will test the Event is emitted + +- Rollout of the StatefulSet blocked due to PVCs never becomes ready, expansion or modify volume failed + - Detection: Events on PVC. controller_{modify,expand}_volume_errors_total metrics on external-resizer + - Mitigations: + - Undo `volumeClaimTemplates` changes + - Set `volumeClaimUpdateStrategy` to `OnClaimDelete` + - Edit PVC manually to correct the issue + - Diagnostics: Events on PVC, logs of external-resizer + - Testing: No. the error is already reported on the PVC, by external-resizer. + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +When SLOs are not being met, events of PVC or StatefulSet are emitted. +If problem is not determined from events, operator should check whether the PVC spec is updated correctly. +If so, follow the troubleshooting instructions of expanding or modifying volume. +If not, look into the KCM log to determine why the PVC is not updated, rasing the log level if necessary. + +## Implementation History + + +- 2024-05-17: initial version +- 2025-06-09: targeting v1.34 for alpha + +## Drawbacks + + + +## Alternatives + + +### Extensively validate the updated `volumeClaimTemplates` + +[KEP-0661] proposes that we should do extensive validation on the updated `volumeClaimTemplates`. +e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. +However, this have saveral drawbacks: +* If we disallow decreasing, we make the editing a one-way road. + If a user edited it then found it was a mistake, there is no way back. + The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable. +* To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. + We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). + For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, + basically duplicate all the validations we have done to PVC. + And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. + I see this as a major drawback of KEP-0661 that I want to avoid in this KEP. +* Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change. +* The validation is conflict to RecoverVolumeExpansionFailure feature. +* `volumeClaimTemplates` is also used when creating new PVCs, so even if the existing PVCs cannot be updated, + a user may still want to affect new PVCs. +* It violates the high-level design. + The template describes a desired final state, rather than an immediate instruction. + A lot of things can happen externally after we update the template. + For example, I have an IaaC platform, which tries to `kubectl apply` one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. + We don't want to reject it just because the StorageClass is applied after the StatefulSet. + +### Support for updating arbitrary fields in `volumeClaimTemplates` + +No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. +This is just an extra validation in APIServer. We may remove it later if we find it is not needed. + +### Patch PVC size regardless of the immutable fields + +We propose to patch the PVC as a whole, so it can only succeed if the immutable fields matches. + +If only expansion is supported, patching regardless of the immutable fields can be a logical choice. +But this KEP also integrates with volumeAttributesClass (VAC). VAC is closely coupled with storage class. +Only patching VAC if storage class matches is a very logical choice. +And we'd better follow the same operation model for all mutable fields. + + +### Support for automatically skip not managed PVCs + +Introduce a new field in StatefulSet `spec.updateStrategy.rollingUpdate`: `volumeClaimSyncStrategy`. +If it is set to `Async`, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match). + +The rules to determine what PVCs are managed are a little bit tricky. +We have to check each field, and determine what to do for each field. +This makes us deeply coupled with the PVC implementation. + +And still, we want to keep the changes small. + +### Reconcile all PVCs regardless of Pod revision labels + +Like Pods, we only update the PVCs if the Pod revision labels is not the update revision. + +We need to unmarshal all revisions used by Pods to determine the desired PVC spec. +Even if we do so, we don't want to send a apply request for each PVC at each reconcile iteration. +We also don't want to replicate the SSA merging/extraction and validation logic, which can be complex and CPU-intensive. + + +### Treat all incompatible PVCs as unavailable replicas + +Currently, incompatible PVCs only blocks the rolling update, not scaling up or down. +Only the update revision is used for checking. + +We need to unmarshal all revisions used by Pods to determine the compatibility. +Even if we do so, old StatefulSets do not have claim info in its history. +If we just use the latest version, then all replicas may suddenly become unavailable, +and all operations are blocked. + +[KEP-0661]: https://github.com/kubernetes/enhancements/pull/3412 + +### Integrate with RecoverVolumeExpansionFailure feature + +We may decrease the size in PVC spec automatically to help recover from a failed expansion +if `RecoverVolumeExpansionFailure` feature gate is enabled. +However, when reducing the spec size of PVC, it must still be greater than its status (not equal to). +So we don't know what to set if `volumeClaimTemplates` is smaller than PVC status. + +User can still update PVC manually. + +### Order of Pod / PVC updates + +We've considered delete the Pod while/before updating the PVC, but realized several issues: +* The admission of PVC update is fairly complex, it can fail for many reasons. + We want to make sure the Pod is still running if we cannot update the PVC. +* As described in [KEP-5381], we want to allow affinity change when the VolumeAttributesClass is updated. + Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node. +* Pod may depends on PVC updates, e.g. when the volume is full. So we should not wait for Pod to be ready before updating PVC. + +That left us with two options: +1. Wait for PVC ready before delete old Pod. +2. Wait for new Pod to be scheduled, with all volumes attached before update PVC. + +We choose 1 currently. This has an extra advantage: +When Pod is ready, PVCs will almost always be ready too. +So any existing tools to monitor StatefulSet rollout process does not need to change. +But this is not guaranteed. If the Pod is deleted before the PVC is ready (be evicted, or manually), +we still want to ensure maximum Pod availability, so we will still create the Pod. +In this case, the Pod may be ready before PVCs are ready. + +We can choose to create Pod at current revision (instead of update revision) if PVCs are not ready. +But there may be some case where the PVCs depends on the new Pod (e.g. old Pod is not schedulable). +We don't want to block them. + +This downside is that the concurrency is lower, so the rolling update may take longer. + +[KEP-5381]: https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional + +### When to track `volumeClaimTemplates` in `ControllerRevision` + +The current design tracks volumeClaimTemplates in ControllerRevision only when `volumeClaimUpdateStrategy` is set to `InPlace`. + +There are two reasons: +1. We want a new revision to trigger the rollout when `volumeClaimUpdateStrategy` is changed from `OnClaimDelete` to `InPlace`. +2. We want to avoid updating all the Pods under any StatefulSet at once when the feature-gate is enabled, to avoid overloading the control-plane. + +If we track volumeClaimTemplates whenever the feature-gate is enabled, we violate all these reasons. + +Or we can make this tri-state: +* empty/nil: the default and preserve the current behavior. +* `OnClaimDelete`: Add volumeClaimTemplate to the history, but don't update PVCs +* `InPlace`: Add volumeClaimTemplate to the history, and also update PVCs in-place + +While this resolves reason 2, it still violates reason 1. + +We can add volumeClaimUpdateStrategy to ControllerRevision to resolve reason 1. +But all the policies we already have does not present in ControllerRevision. So this is not ideal either. + +The down-side of the current design is that `kubectl rollout undo` may not work as expected sometimes. + +* If `volumeClaimUpdateStrategy` is set to `OnClaimDelete`, `kubectl rollout undo` will not undo the `volumeClaimTemplates`. +* When changing `volumeClaimUpdateStrategy` from `OnClaimDelete` to `InPlace` to trigger the rollout, `kubectl rollout undo` will be no-op. +* Consider the following history: + 1. Pod Rev1 + PVC Rev1 + `OnClaimDelete` + 2. Pod Rev2 + PVC Rev1 + `InPlace` + 3. Pod Rev2 + PVC Rev2 + `InPlace` + + Now if user revert to history 1 directly, `volumeClaimTemplates` will not be reverted. + But if the user revert to history 2, then history 1, `volumeClaimTemplates` will be reverted. + +While somewhat surprising, `kubectl rollout undo` is just a convenient method to update the StatefulSet. +User can always do the update manually. So this is not a big problem. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml new file mode 100644 index 00000000000..c4b3acd7889 --- /dev/null +++ b/keps/sig-apps/4650-stateful-set-update-claim-template/kep.yaml @@ -0,0 +1,52 @@ +title: StatefulSet Support for Updating Volume Claim Template +kep-number: 4650 +authors: + - "@huww98" + - "@vie-serendipity" +owning-sig: sig-apps +participating-sigs: + - sig-storage +status: implementable +creation-date: 2024-05-17 +reviewers: + - "@kow3ns" + - "@gnufied" + - "@msau42" + - "@xing-yang" + - "@soltysh" +approvers: + - "@soltysh" + - "@xing-yang" + +see-also: + - "/keps/sig-storage/1790-recover-resize-failure" + - "/keps/sig-storage/3751-volume-attributes-class" +replaces: + - "https://github.com/kubernetes/enhancements/pull/2842" # Previous attempt on 0611 + - "https://github.com/kubernetes/enhancements/pull/3412" # Previous attempt on 0611 + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.35" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StatefulSetUpdateVolumeClaimTemplate + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []