@@ -113,6 +113,7 @@ tags, and then generate with `hack/update-toc.sh`.
113
113
- [ Reconcile all PVCs regardless of Pod revision labels] ( #reconcile-all-pvcs-regardless-of-pod-revision-labels )
114
114
- [ Treat all incompatible PVCs as unavailable replicas] ( #treat-all-incompatible-pvcs-as-unavailable-replicas )
115
115
- [ Integrate with RecoverVolumeExpansionFailure feature] ( #integrate-with-recovervolumeexpansionfailure-feature )
116
+ - [ Order of Pod / PVC updates] ( #order-of-pod--pvc-updates )
116
117
- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
117
118
<!-- /toc -->
118
119
@@ -186,6 +187,7 @@ Specifically, we will allow modifying the following fields of `spec.volumeClaimT
186
187
* modifying Volume AttributesClass used by the claim (` spec.volumeClaimTemplates.spec.volumeAttributesClassName ` )
187
188
* modifying VolumeClaim template's labels (` spec.volumeClaimTemplates.metadata.labels ` )
188
189
* modifying VolumeClaim template's annotations (` spec.volumeClaimTemplates.metadata.annotations ` )
190
+
189
191
When ` volumeClaimTemplates ` is updated, the StatefulSet controller will reconcile the
190
192
PersistentVolumeClaims in the StatefulSet's pods.
191
193
The behavior of updating PersistentVolumeClaim is similar to updating Pod.
@@ -236,6 +238,11 @@ and make progress.
236
238
* Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs.
237
239
* Support for volumes that only support offline expansion.
238
240
241
+ ```
242
+ <<[UNRESOLVED offline resize ]>>
243
+ Whether/How we should support offline resize volume.
244
+ <<[/UNRESOLVED]>>
245
+ ```
239
246
240
247
## Proposal
241
248
@@ -264,7 +271,6 @@ specify how to coordinate the update of PVCs and Pods. Possible values are:
264
271
265
272
Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
266
273
Some fields in the ` status ` are updated to reflect the status of the PVCs:
267
- - claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
268
274
- currentRevision, updateRevision, currentReplicas, updatedReplicas
269
275
are updated to reflect the status of PVCs.
270
276
@@ -358,7 +364,7 @@ This might be a good place to talk about core concepts and how they relate.
358
364
When designing the ` InPlace ` update strategy, we want to reuse the infrastructures controlling Pod rollout.
359
365
We apply the changes to the PVCs before we set new ` controller-revision-hash ` label.
360
366
New invariance established about PVCs:
361
- If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.
367
+ If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A and ready .
362
368
363
369
We introduce ` controller-revision-hash ` label on PVCs to:
364
370
* Record where have progressed, to ensure each PVC is only updated once per rollout.
@@ -420,28 +426,31 @@ but StatefulSet controller should not touch the PVCs and preserve the current be
420
426
Following describes the workflow when ` volumeClaimUpdatePolicy ` is ` InPlace ` .
421
427
422
428
When updating volumeClaimTemplates along with pod template, we will go through the following steps:
423
- 1 . Delete the old pod.
424
- 2 . Apply the changes to the PVCs used by this replica.
425
- 3 . Create the new pod with new ` controller-revision-hash ` label.
426
- 4 . Wait for the new pod and PVCs to be ready.
427
- 5 . Advance to the next replica and repeat from step 1.
429
+ 1 . Apply the changes to the PVCs used by this replica.
430
+ 2 . Wait for the PVCs to be ready.
431
+ 3 . Delete the old pod.
432
+ 4 . Create the new pod with new ` controller-revision-hash ` label.
433
+ 5 . Wait for the new pod to be ready.
434
+ 6 . Advance to the next replica and repeat from step 1.
428
435
429
436
When only updating the volumeClaimTemplates:
430
437
1 . Apply the changes to the PVCs used by this replica.
431
- 2 . Update the pod with new ` controller-revision-hash ` label .
432
- 3 . Wait for the PVCs to be ready .
438
+ 2 . Wait for the PVCs to be ready .
439
+ 3 . Update the pod with new ` controller-revision-hash ` label .
433
440
4 . Advance to the next replica and repeat from step 1.
434
441
435
442
Assuming we are updating a replica from revision A to revision B:
436
443
437
444
| Pod | PVC | Action |
438
445
| --- | --- | --- |
439
- | - | not existing | create PVC at revision B |
446
+ | not existing | not existing | create PVC at revision B |
440
447
| not existing | at revision A | update PVC to revision B |
441
- | not existing | at revision B | create Pod at revision B |
448
+ | not existing | at revision B | update Pod at revision B |
449
+ | at revision A | not existing | create PVC at revision B |
442
450
| at revision A | at revision A | update PVC to revision B |
443
- | at revision A | at revision B | delete Pod or update Pod label |
444
- | at revision B | existing | wait for Pod/PVC to be ready |
451
+ | at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label |
452
+ | at revision B | not existing | create PVC at revision B |
453
+ | at revision B | existing | wait for Pod to be ready |
445
454
446
455
Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
447
456
Such state can only happen when user set ` volumeClaimUpdatePolicy ` to ` InPlace ` when the feature-gate of KCM is disabled,
@@ -451,10 +460,16 @@ We require user to initiate another rollout to update the PVCs, to avoid any sur
451
460
Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.
452
461
453
462
- If the PVC update fails, we should block the StatefulSet rollout process.
454
- This will also block the creation of new Pod.
455
- We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
456
- If this still happens (e.g., because of webhook), We should retry and report events for this.
463
+ We should retry and report events for this.
457
464
The events and status should look like those when the Pod creation fails.
465
+ We update PVC before deleting the old Pod, so failure of PVC update should not disrupt running Pods,
466
+ and user should have time to fix this manually.
467
+ The failure cases of this kind includes (but not limited to):
468
+ - immutable fields mismatch (e.g. storageClassName)
469
+ - webhook
470
+ - [ storage quota] ( https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota )
471
+ - [ VAC quota] ( https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-volumeattributesclass )
472
+ - StorageClass.allowVolumeExpansion not set to true
458
473
459
474
- While waiting for the PVC to become ready,
460
475
We should update status, just like what we do when waiting for Pod to be ready.
@@ -1240,6 +1255,21 @@ So we don't know what to set if `volumeClaimTemplates` is smaller than PVC statu
1240
1255
1241
1256
User can still update PVC manually.
1242
1257
1258
+ ### Order of Pod / PVC updates
1259
+
1260
+ We've considered delete the Pod while/before updating the PVC, but realized several issues:
1261
+ * The addmission of PVC update is fairly complex, it can fail for many reasons.
1262
+ We want to make sure the Pod is still running if we cannot update the PVC.
1263
+ * As described in [ KEP-5381] , we want to allow affinity change when the VolumeAttributesClass is updated.
1264
+ Updating PVC and Pod concurrently may trigger a race condition where the Pod can be scheduled to wrong node.
1265
+
1266
+ The current order (wait for PVC ready before delete old Pod) has an extra advantage:
1267
+ When Pod is ready, it is guaranteed that the PVC is ready too.
1268
+ So any existing tools to monitor StatefulSet rollout process does not need to change.
1269
+
1270
+ This downside is that the concurrency is lower, so the rolling update may take longer.
1271
+
1272
+ [ KEP-5381 ] : https://github.com/kubernetes/enhancements/blob/0602a5f744b8e4e201d7bd90eb69e67f1b9baf62/keps/sig-storage/5381-mutable-pv-affinity/README.md#notesconstraintscaveats-optional
1243
1273
1244
1274
## Infrastructure Needed (Optional)
1245
1275
0 commit comments