Skip to content

Commit 0c4d6b5

Browse files
committed
change PVC/Pod update order
1 parent af71200 commit 0c4d6b5

File tree

1 file changed

+41
-16
lines changed
  • keps/sig-apps/4650-stateful-set-update-claim-template

1 file changed

+41
-16
lines changed

keps/sig-apps/4650-stateful-set-update-claim-template/README.md

Lines changed: 41 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ tags, and then generate with `hack/update-toc.sh`.
113113
- [Reconcile all PVCs regardless of Pod revision labels](#reconcile-all-pvcs-regardless-of-pod-revision-labels)
114114
- [Treat all incompatible PVCs as unavailable replicas](#treat-all-incompatible-pvcs-as-unavailable-replicas)
115115
- [Integrate with RecoverVolumeExpansionFailure feature](#integrate-with-recovervolumeexpansionfailure-feature)
116+
- [Order of Pod / PVC updates](#order-of-pod--pvc-updates)
116117
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
117118
<!-- /toc -->
118119

@@ -186,6 +187,7 @@ Specifically, we will allow modifying the following fields of `spec.volumeClaimT
186187
* modifying Volume AttributesClass used by the claim (`spec.volumeClaimTemplates.spec.volumeAttributesClassName`)
187188
* modifying VolumeClaim template's labels (`spec.volumeClaimTemplates.metadata.labels`)
188189
* modifying VolumeClaim template's annotations (`spec.volumeClaimTemplates.metadata.annotations`)
190+
189191
When `volumeClaimTemplates` is updated, the StatefulSet controller will reconcile the
190192
PersistentVolumeClaims in the StatefulSet's pods.
191193
The behavior of updating PersistentVolumeClaim is similar to updating Pod.
@@ -264,7 +266,6 @@ specify how to coordinate the update of PVCs and Pods. Possible values are:
264266

265267
Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
266268
Some fields in the `status` are updated to reflect the status of the PVCs:
267-
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
268269
- currentRevision, updateRevision, currentReplicas, updatedReplicas
269270
are updated to reflect the status of PVCs.
270271

@@ -358,7 +359,7 @@ This might be a good place to talk about core concepts and how they relate.
358359
When designing the `InPlace` update strategy, we want to reuse the infrastructures controlling Pod rollout.
359360
We apply the changes to the PVCs before we set new `controller-revision-hash` label.
360361
New invariance established about PVCs:
361-
If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.
362+
If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready.
362363

363364
We introduce `controller-revision-hash` label on PVCs to:
364365
* Record where have progressed, to ensure each PVC is only updated once per rollout.
@@ -420,28 +421,31 @@ but StatefulSet controller should not touch the PVCs and preserve the current be
420421
Following describes the workflow when `volumeClaimUpdatePolicy` is `InPlace`.
421422

422423
When updating volumeClaimTemplates along with pod template, we will go through the following steps:
423-
1. Delete the old pod.
424-
2. Apply the changes to the PVCs used by this replica.
425-
3. Create the new pod with new `controller-revision-hash` label.
426-
4. Wait for the new pod and PVCs to be ready.
427-
5. Advance to the next replica and repeat from step 1.
424+
1. Apply the changes to the PVCs used by this replica.
425+
2. Wait for the PVCs to be ready.
426+
3. Delete the old pod.
427+
4. Create the new pod with new `controller-revision-hash` label.
428+
5. Wait for the new pod to be ready.
429+
6. Advance to the next replica and repeat from step 1.
428430

429431
When only updating the volumeClaimTemplates:
430432
1. Apply the changes to the PVCs used by this replica.
431-
2. Update the pod with new `controller-revision-hash` label.
432-
3. Wait for the PVCs to be ready.
433+
2. Wait for the PVCs to be ready.
434+
3. Update the pod with new `controller-revision-hash` label.
433435
4. Advance to the next replica and repeat from step 1.
434436

435437
Assuming we are updating a replica from revision A to revision B:
436438

437439
| Pod | PVC | Action |
438440
| --- | --- | --- |
439-
| - | not existing | create PVC at revision B |
441+
| not existing | not existing | create PVC at revision B |
440442
| not existing | at revision A | update PVC to revision B |
441-
| not existing | at revision B | create Pod at revision B |
443+
| not existing | at revision B | update Pod at revision B |
444+
| at revision A | not existing | create PVC at revision B |
442445
| at revision A | at revision A | update PVC to revision B |
443-
| at revision A | at revision B | delete Pod or update Pod label |
444-
| at revision B | existing | wait for Pod/PVC to be ready |
446+
| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label |
447+
| at revision B | not existing | create PVC at revision B |
448+
| at revision B | existing | wait for Pod to be ready |
445449

446450
Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
447451
Such state can only happen when user set `volumeClaimUpdatePolicy` to `InPlace` when the feature-gate of KCM is disabled,
@@ -451,10 +455,16 @@ We require user to initiate another rollout to update the PVCs, to avoid any sur
451455
Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.
452456

453457
- If the PVC update fails, we should block the StatefulSet rollout process.
454-
This will also block the creation of new Pod.
455-
We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
456-
If this still happens (e.g., because of webhook), We should retry and report events for this.
458+
We should retry and report events for this.
457459
The events and status should look like those when the Pod creation fails.
460+
We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods,
461+
and user should have time to fix this manually.
462+
The failure cases of this kind includes (but not limited to):
463+
- immutable fields mismatch (e.g. storageClassName)
464+
- webhook
465+
- [storage quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota)
466+
- [VAC quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass)
467+
- StorageClass.allowVolumeExpansion not set to true
458468

459469
- While waiting for the PVC to become ready,
460470
We should update status, just like what we do when waiting for Pod to be ready.
@@ -1240,6 +1250,21 @@ So we don't know what to set if `volumeClaimTemplates` is smaller than PVC statu
12401250

12411251
User can still update PVC manually.
12421252

1253+
### Order of Pod / PVC updates
1254+
1255+
We've considered delete the Pod while/before updating the PVC, but realized several issues:
1256+
* The admission of PVC update is fairly complex, it can fail for many reasons.
1257+
We want to make sure the Pod is still running if we cannot update the PVC.
1258+
* As described in [KEP-5381], we want to allow affinity change when the VolumeAttributesClass is updated.
1259+
Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node.
1260+
1261+
The current order (wait for PVC ready before delete old Pod) has an extra advantage:
1262+
When Pod is ready, it is guaranteed that the PVC is ready too.
1263+
So any existing tools to monitor StatefulSet rollout process does not need to change.
1264+
1265+
This downside is that the concurrency is lower, so the rolling update may take longer.
1266+
1267+
[KEP-5381]: https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional
12431268

12441269
## Infrastructure Needed (Optional)
12451270

0 commit comments

Comments
 (0)