@@ -113,6 +113,7 @@ tags, and then generate with `hack/update-toc.sh`.
113
113
- [ Reconcile all PVCs regardless of Pod revision labels] ( #reconcile-all-pvcs-regardless-of-pod-revision-labels )
114
114
- [ Treat all incompatible PVCs as unavailable replicas] ( #treat-all-incompatible-pvcs-as-unavailable-replicas )
115
115
- [ Integrate with RecoverVolumeExpansionFailure feature] ( #integrate-with-recovervolumeexpansionfailure-feature )
116
+ - [ Order of Pod / PVC updates] ( #order-of-pod--pvc-updates )
116
117
- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
117
118
<!-- /toc -->
118
119
@@ -186,6 +187,7 @@ Specifically, we will allow modifying the following fields of `spec.volumeClaimT
186
187
* modifying Volume AttributesClass used by the claim (` spec.volumeClaimTemplates.spec.volumeAttributesClassName ` )
187
188
* modifying VolumeClaim template's labels (` spec.volumeClaimTemplates.metadata.labels ` )
188
189
* modifying VolumeClaim template's annotations (` spec.volumeClaimTemplates.metadata.annotations ` )
190
+
189
191
When ` volumeClaimTemplates ` is updated, the StatefulSet controller will reconcile the
190
192
PersistentVolumeClaims in the StatefulSet's pods.
191
193
The behavior of updating PersistentVolumeClaim is similar to updating Pod.
@@ -264,7 +266,6 @@ specify how to coordinate the update of PVCs and Pods. Possible values are:
264
266
265
267
Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
266
268
Some fields in the ` status ` are updated to reflect the status of the PVCs:
267
- - claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
268
269
- currentRevision, updateRevision, currentReplicas, updatedReplicas
269
270
are updated to reflect the status of PVCs.
270
271
@@ -358,7 +359,7 @@ This might be a good place to talk about core concepts and how they relate.
358
359
When designing the ` InPlace ` update strategy, we want to reuse the infrastructures controlling Pod rollout.
359
360
We apply the changes to the PVCs before we set new ` controller-revision-hash ` label.
360
361
New invariance established about PVCs:
361
- If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.
362
+ If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready .
362
363
363
364
We introduce ` controller-revision-hash ` label on PVCs to:
364
365
* Record where have progressed, to ensure each PVC is only updated once per rollout.
@@ -420,28 +421,31 @@ but StatefulSet controller should not touch the PVCs and preserve the current be
420
421
Following describes the workflow when ` volumeClaimUpdatePolicy ` is ` InPlace ` .
421
422
422
423
When updating volumeClaimTemplates along with pod template, we will go through the following steps:
423
- 1 . Delete the old pod.
424
- 2 . Apply the changes to the PVCs used by this replica.
425
- 3 . Create the new pod with new ` controller-revision-hash ` label.
426
- 4 . Wait for the new pod and PVCs to be ready.
427
- 5 . Advance to the next replica and repeat from step 1.
424
+ 1 . Apply the changes to the PVCs used by this replica.
425
+ 2 . Wait for the PVCs to be ready.
426
+ 3 . Delete the old pod.
427
+ 4 . Create the new pod with new ` controller-revision-hash ` label.
428
+ 5 . Wait for the new pod to be ready.
429
+ 6 . Advance to the next replica and repeat from step 1.
428
430
429
431
When only updating the volumeClaimTemplates:
430
432
1 . Apply the changes to the PVCs used by this replica.
431
- 2 . Update the pod with new ` controller-revision-hash ` label .
432
- 3 . Wait for the PVCs to be ready .
433
+ 2 . Wait for the PVCs to be ready .
434
+ 3 . Update the pod with new ` controller-revision-hash ` label .
433
435
4 . Advance to the next replica and repeat from step 1.
434
436
435
437
Assuming we are updating a replica from revision A to revision B:
436
438
437
439
| Pod | PVC | Action |
438
440
| --- | --- | --- |
439
- | - | not existing | create PVC at revision B |
441
+ | not existing | not existing | create PVC at revision B |
440
442
| not existing | at revision A | update PVC to revision B |
441
- | not existing | at revision B | create Pod at revision B |
443
+ | not existing | at revision B | update Pod at revision B |
444
+ | at revision A | not existing | create PVC at revision B |
442
445
| at revision A | at revision A | update PVC to revision B |
443
- | at revision A | at revision B | delete Pod or update Pod label |
444
- | at revision B | existing | wait for Pod/PVC to be ready |
446
+ | at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label |
447
+ | at revision B | not existing | create PVC at revision B |
448
+ | at revision B | existing | wait for Pod to be ready |
445
449
446
450
Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
447
451
Such state can only happen when user set ` volumeClaimUpdatePolicy ` to ` InPlace ` when the feature-gate of KCM is disabled,
@@ -451,10 +455,16 @@ We require user to initiate another rollout to update the PVCs, to avoid any sur
451
455
Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.
452
456
453
457
- If the PVC update fails, we should block the StatefulSet rollout process.
454
- This will also block the creation of new Pod.
455
- We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
456
- If this still happens (e.g., because of webhook), We should retry and report events for this.
458
+ We should retry and report events for this.
457
459
The events and status should look like those when the Pod creation fails.
460
+ We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods,
461
+ and user should have time to fix this manually.
462
+ The failure cases of this kind includes (but not limited to):
463
+ - immutable fields mismatch (e.g. storageClassName)
464
+ - webhook
465
+ - [ storage quota] ( https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota )
466
+ - [ VAC quota] ( https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass )
467
+ - StorageClass.allowVolumeExpansion not set to true
458
468
459
469
- While waiting for the PVC to become ready,
460
470
We should update status, just like what we do when waiting for Pod to be ready.
@@ -1240,6 +1250,21 @@ So we don't know what to set if `volumeClaimTemplates` is smaller than PVC statu
1240
1250
1241
1251
User can still update PVC manually.
1242
1252
1253
+ ### Order of Pod / PVC updates
1254
+
1255
+ We've considered delete the Pod while/before updating the PVC, but realized several issues:
1256
+ * The admission of PVC update is fairly complex, it can fail for many reasons.
1257
+ We want to make sure the Pod is still running if we cannot update the PVC.
1258
+ * As described in [ KEP-5381] , we want to allow affinity change when the VolumeAttributesClass is updated.
1259
+ Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node.
1260
+
1261
+ The current order (wait for PVC ready before delete old Pod) has an extra advantage:
1262
+ When Pod is ready, it is guaranteed that the PVC is ready too.
1263
+ So any existing tools to monitor StatefulSet rollout process does not need to change.
1264
+
1265
+ This downside is that the concurrency is lower, so the rolling update may take longer.
1266
+
1267
+ [ KEP-5381 ] : https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional
1243
1268
1244
1269
## Infrastructure Needed (Optional)
1245
1270
0 commit comments