Skip to content

Commit aa58e48

Browse files
committed
change PVC/Pod update order
1 parent af71200 commit aa58e48

File tree

1 file changed

+46
-16
lines changed
  • keps/sig-apps/4650-stateful-set-update-claim-template

1 file changed

+46
-16
lines changed

keps/sig-apps/4650-stateful-set-update-claim-template/README.md

Lines changed: 46 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ tags, and then generate with `hack/update-toc.sh`.
113113
- [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels)
114114
- [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas)
115115
- [Integrate with RecoverVolumeExpansionFailure feature](#integrate-with-recovervolumeexpansionfailure-feature)
116+
- [Order of Pod / PVC updates](#order-of-pod--pvc-updates)
116117
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
117118
<!-- /toc -->
118119

@@ -186,6 +187,7 @@ Specifically, we will allow modifying the following fields of `spec.volumeClaimT
186187
* modifying Volume AttributesClass used by the claim (`spec.volumeClaimTemplates.spec.volumeAttributesClassName`)
187188
* modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`)
188189
* modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`)
190+
189191
When `volumeClaimTemplates` is updated, the StatefulSet controller will reconcile the
190192
PersistentVolumeClaims in the StatefulSet's pods.
191193
The behavior of updating PersistentVolumeClaim is similar to updating Pod.
@@ -236,6 +238,11 @@ and make progress.
236238
* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs.
237239
* Support for volumes that only support offline expansion.
238240

241+
```
242+
<<[UNRESOLVED offline resize ]>>
243+
Whether/How we should support offline resize volume.
244+
<<[/UNRESOLVED]>>
245+
```
239246

240247
## Proposal
241248

@@ -264,7 +271,6 @@ specify how to coordinate the update of PVCs and Pods. Possible values are:
264271

265272
Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
266273
Some fields in the `status` are updated to reflect the status of the PVCs:
267-
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
268274
- currentRevision, updateRevision, currentReplicas, updatedReplicas
269275
are updated to reflect the status of PVCs.
270276

@@ -358,7 +364,7 @@ This might be a good place to talk about core concepts and how they relate.
358364
When designing the `InPlace` update strategy, we want to reuse the infrastructures controlling Pod rollout.
359365
We apply the changes to the PVCs before we set new `controller-revision-hash` label.
360366
New invariance established about PVCs:
361-
If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.
367+
If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready.
362368

363369
We introduce `controller-revision-hash` label on PVCs to:
364370
* Record where have progressed, to ensure each PVC is only updated once per rollout.
@@ -420,28 +426,31 @@ but StatefulSet controller should not touch the PVCs and preserve the current be
420426
Following describes the workflow when `volumeClaimUpdatePolicy` is `InPlace`.
421427

422428
When updating volumeClaimTemplates along with pod template, we will go through the following steps:
423-
1. Delete the old pod.
424-
2. Apply the changes to the PVCs used by this replica.
425-
3. Create the new pod with new `controller-revision-hash` label.
426-
4. Wait for the new pod and PVCs to be ready.
427-
5. Advance to the next replica and repeat from step 1.
429+
1. Apply the changes to the PVCs used by this replica.
430+
2. Wait for the PVCs to be ready.
431+
3. Delete the old pod.
432+
4. Create the new pod with new `controller-revision-hash` label.
433+
5. Wait for the new pod to be ready.
434+
6. Advance to the next replica and repeat from step 1.
428435

429436
When only updating the volumeClaimTemplates:
430437
1. Apply the changes to the PVCs used by this replica.
431-
2. Update the pod with new `controller-revision-hash` label.
432-
3. Wait for the PVCs to be ready.
438+
2. Wait for the PVCs to be ready.
439+
3. Update the pod with new `controller-revision-hash` label.
433440
4. Advance to the next replica and repeat from step 1.
434441

435442
Assuming we are updating a replica from revision A to revision B:
436443

437444
| Pod | PVC | Action |
438445
| --- | --- | --- |
439-
| - | not existing | create PVC at revision B |
446+
| not existing | not existing | create PVC at revision B |
440447
| not existing | at revision A | update PVC to revision B |
441-
| not existing | at revision B | create Pod at revision B |
448+
| not existing | at revision B | update Pod at revision B |
449+
| at revision A | not existing | create PVC at revision B |
442450
| at revision A | at revision A | update PVC to revision B |
443-
| at revision A | at revision B | delete Pod or update Pod label |
444-
| at revision B | existing | wait for Pod/PVC to be ready |
451+
| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label |
452+
| at revision B | not existing | create PVC at revision B |
453+
| at revision B | existing | wait for Pod to be ready |
445454

446455
Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
447456
Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` when the feature-gate of KCM is disabled,
@@ -451,10 +460,16 @@ We require user to initiate another rollout to update the PVCs, to avoid any sur
451460
Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.
452461

453462
- If the PVC update fails, we should block the StatefulSet rollout process.
454-
This will also block the creation of new Pod.
455-
We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
456-
If this still happens (e.g., because of webhook), We should retry and report events for this.
463+
We should retry and report events for this.
457464
The events and status should look like those when the Pod creation fails.
465+
We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods,
466+
and user should have time to fix this manually.
467+
The failure cases of this kind includes (but not limited to):
468+
- immutable fields mismatch (e.g. storageClassName)
469+
- webhook
470+
- [storage quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota)
471+
- [VAC quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass)
472+
- StorageClass.allowVolumeExpansion not set to true
458473

459474
- While waiting for the PVC to become ready,
460475
We should update status, just like what we do when waiting for Pod to be ready.
@@ -1240,6 +1255,21 @@ So we don't know what to set if `volumeClaimTemplates` is smaller than PVC statu
12401255

12411256
User can still update PVC manually.
12421257

1258+
### Order of Pod / PVC updates
1259+
1260+
We've considered delete the Pod while/before updating the PVC, but realized several issues:
1261+
* The addmission of PVC update is fairly complex, it can fail for many reasons.
1262+
We want to make sure the Pod is still running if we cannot update the PVC.
1263+
* As described in [KEP-5381], we want to allow affinity change when the VolumeAttributesClass is updated.
1264+
Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node.
1265+
1266+
The current order (wait for PVC ready before delete old Pod) has an extra advantage:
1267+
When Pod is ready, it is guaranteed that the PVC is ready too.
1268+
So any existing tools to monitor StatefulSet rollout process does not need to change.
1269+
1270+
This downside is that the concurrency is lower, so the rolling update may take longer.
1271+
1272+
[KEP-5381]: https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional
12431273

12441274
## Infrastructure Needed (Optional)
12451275

0 commit comments

Comments
 (0)