Skip to content

Commit d1ded6e

Browse files
committed
Update kep-4876 Mutable CSINode Allocatable for Beta
1 parent ee25de8 commit d1ded6e

File tree

3 files changed

+142
-38
lines changed

3 files changed

+142
-38
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4876
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-storage/4876-mutable-csinode-allocatable/README.md

Lines changed: 136 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,12 @@
1818
- [API Changes](#api-changes)
1919
- [CSINode](#csinode)
2020
- [CSIDriver](#csidriver)
21+
- [VolumeError](#volumeerror)
2122
- [Validation Changes](#validation-changes)
22-
- [Volume Plugin Manager](#volume-plugin-manager)
23+
- [CSI Node Updater](#csi-node-updater)
24+
- [Implementation details](#implementation-details)
25+
- [Update behavior](#update-behavior)
26+
- [Error handling](#error-handling)
2327
- [NodeInfoManager Interface Extension](#nodeinfomanager-interface-extension)
2428
- [CSINode Update Behavior](#csinode-update-behavior)
2529
- [Pod Construction Changes](#pod-construction-changes)
@@ -186,21 +190,45 @@ type VolumeNodeResources struct {
186190

187191
#### CSIDriver
188192

189-
A new field, `NodeAllocatableUpdatePeriodSeconds`, will be added to the `CSIDriverSpec` struct. This field allows a CSI driver to specify the interval at which the Kubelet should periodically query a driver's `NodeGetInfo` RPC endpoint to update the `CSINode` object. If this field is not set, updates will only occur in response to volume attachment failures as a result of no capacity.
193+
A new field, `NodeAllocatableUpdatePeriodSeconds`, will be added to the `CSIDriverSpec` struct. This field allows a CSI driver to specify the interval at which the Kubelet should periodically query a driver's `NodeGetInfo` RPC endpoint to update the `CSINode` object. If this field is not set, no updates occur (neither periodic nor upon detecting capacity-related failures), and the allocatable count remains static.
190194

191195
```golang
192196
// CSIDriverSpec is the specification of a CSIDriver.
193197
type CSIDriverSpec struct {
194198
...
195-
// NodeAllocatableUpdatePeriodSeconds specifies the interval between periodic updates of
196-
// the CSINode allocatable capacity for this driver. If not set, periodic updates
197-
// are disabled, and updates occur only upon detecting capacity-related failures.
198-
// The minimum allowed value for this field is 10 seconds.
199-
// +optional
199+
// nodeAllocatableUpdatePeriodSeconds specifies the interval between periodic updates of
200+
// the CSINode allocatable capacity for this driver. When set, both periodic updates and
201+
// updates triggered by capacity-related failures are enabled. If not set, no updates
202+
// occur (neither periodic nor upon detecting capacity-related failures), and the
203+
// allocatable.count remains static. The minimum allowed value for this field is 10 seconds.
204+
//
205+
//
206+
// This field is mutable.
207+
//
208+
// +featureGate=MutableCSINodeAllocatableCount
209+
// +optional
200210
NodeAllocatableUpdatePeriodSeconds *int64
201211
}
202212
```
203213

214+
#### VolumeError
215+
216+
A new field, `ErrorCode`, will be added to the `VolumeError` struct to facilitate detection of capacity-related errors:
217+
218+
```golang
219+
// Captures an error encountered during a volume operation.
220+
type VolumeError struct {
221+
...
222+
// errorCode is a numeric gRPC code representing the error encountered during Attach or Detach operations.
223+
//
224+
// This is an optional field that requires the MutableCSINodeAllocatableCount feature gate being enabled to be set.
225+
//
226+
// +featureGate=MutableCSINodeAllocatableCount
227+
// +optional
228+
ErrorCode *int32
229+
}
230+
```
231+
204232
#### Validation Changes
205233

206234
The [ValidateCSINodeUpdate](https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/storage/validation/validation.go#L304) function in the API validation code path will be modified to allow updates to the `Allocatable.Count`
@@ -226,20 +254,53 @@ func ValidateCSINodeUpdate(new, old *storage.CSINode) field.ErrorList {
226254

227255
This updated logic allows the `Allocatable.Count` field to be modified when the feature gate is enabled, while ensuring all other fields remain immutable. When the feature gate is disabled, it falls back to the existing validation logic for backward compatibility.
228256

229-
#### Volume Plugin Manager
257+
#### CSI Node Updater
258+
259+
A new plugin-level updated will be implemented in `kubernetes/pkg/volume/csi/csi_node_updater.go` to manage periodic updates of CSINode allocatable counts. This updater watches for changes to CSIDriver objects and manages per-driver update goroutines based on the `NodeAllocatableUpdatePeriodSeconds` setting.
230260

231-
A new goroutine will be started in VolumePluginMgr’s [Run()](https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/plugins.go#L953) func if the `NodeAllocatableUpdatePeriodSeconds` is set to a nonzero value. This goroutine will periodically trigger updates to the `CSINode` object based on the specified interval:
261+
##### Implementation details
232262

233263
```golang
234-
func (pm *VolumePluginMgr) Run(stopCh <-chan struct{}) {
235-
if pm.csiNodeUpdateInterval > 0 {
236-
go wait.Until(pm.updateCSINodeInfo, pm.csiNodeUpdateInterval, stopCh)
264+
// csiNodeUpdater watches for changes to CSIDriver objects and manages the lifecycle
265+
// of per-driver goroutines that periodically update CSINodeDriver.Allocatable information
266+
type csiNodeUpdater struct {
267+
// Informer for CSIDriver objects
268+
driverInformer cache.SharedIndexInformer
269+
270+
// Map of driver names to stop channels for update goroutines
271+
driverUpdaters sync.Map
272+
273+
// Ensures the updater is only started once
274+
once sync.Once
275+
}
276+
```
277+
#### Update behavior
278+
279+
When a `CSIDriver` object is added or updated with `NodeAllocatableUpdatePeriodSeconds` set, the updater checks if the driver is installed on the node before running periodic updates.
280+
281+
When `NodeAllocatableUpdatePeriodSeconds` is modified, the updater automatically adjusts by stopping the old goroutine and starting a new one. Setting the period to 0 or nil stops updates entirely. Driver uninstallation or `CSIDriver` object deletion also stops the update goroutine for that specific driver.
282+
283+
```golang
284+
func (u *csiNodeUpdater) runPeriodicUpdate(driverName string, period time.Duration, stopCh <-chan struct{}) {
285+
ticker := time.NewTicker(period)
286+
defer ticker.Stop()
287+
288+
for {
289+
select {
290+
case <-ticker.C:
291+
if err := updateCSIDriver(driverName); err != nil {
292+
klog.ErrorS(err, "Failed to update CSIDriver", "driver", driverName)
293+
}
294+
case <-stopCh:
295+
return
296+
}
237297
}
238298
}
239299
```
240300

241-
In case of a failure during the `updateCSINodeInfo` call, the `Allocatable.Count` will retain its current value and `updateCSINodeInfo` will be retried.
301+
#### Error handling
242302

303+
If `updateCSIDriver()` fails, the error is logged but the allocatable count retains its current value. Updates continue at the configured interval regardless of individual failures.
243304

244305
#### NodeInfoManager Interface Extension
245306

@@ -262,7 +323,7 @@ This table explains how updates to the `CSINode.Spec.Drivers[*].Allocatable.Coun
262323
| **Feature Flag Status** | **`NodeAllocatableUpdatePeriodSeconds`** | **Behavior** |
263324
|------------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
264325
| Enabled | Set | Periodic updates occur at the defined interval + when invalid state is detected (volume attachment failures due to `ResourceExhausted`)|
265-
| Enabled | Not set | Updates occur only in response to volume attachment failures (`ResourceExhausted` errors) |
326+
| Enabled | Not set | No updates occur; `Allocatable.Count` remains static |
266327
| Disabled | Set | `NodeAllocatableUpdatePeriodSeconds` is ignored; `Allocatable.Count` remains static and immutable |
267328
| Disabled | Not set | No updates occur; `Allocatable.Count` remains static and immutable |
268329

@@ -271,7 +332,7 @@ This table explains how updates to the `CSINode.Spec.Drivers[*].Allocatable.Coun
271332

272333
To address race conditions where the scheduler assigns stateful pods to nodes with insufficient capacity, Kubelet's pod construction process during [WaitForAttachAndMount](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/volumemanager/volume_manager.go#L393) will now handle `ResourceExhausted` errors returned by CSI drivers during the `ControllerPublishVolume` RPC.
273334

274-
The `ResourceExhausted` error is directly reported on the `VolumeAttachment` object associated with the relevant attachment. To facilitate easier detection of `ResourceExhausted` errors from `VolumeAttachment` statuses, we propose adding a `StatusCode` field to the [VolumeError](https://github.com/kubernetes/api/blob/master/storage/v1/types.go#L219) struct.
335+
The `ResourceExhausted` error is directly reported on the `VolumeAttachment` object associated with the relevant attachment. To facilitate easier detection of `ResourceExhausted` errors from `VolumeAttachment` statuses, we propose adding a `ErrorCode` field to the [VolumeError](https://github.com/kubernetes/api/blob/master/storage/v1/types.go#L219) struct.
275336

276337
```golang
277338
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
@@ -395,7 +456,6 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
395456

396457
#### Beta
397458

398-
- Allowing time for feedback (at least 2 releases between beta and GA).
399459
- All unit tests/integration/e2e tests completed and enabled.
400460
- Validate kubelet behavior when API server rejects `CSINode` updates (older API server version).
401461
- Validate CSI driver behavior with and without the `NodeAllocatableUpdatePeriodSeconds` field set.
@@ -405,8 +465,7 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
405465

406466
#### GA
407467

408-
- All beta criteria have been satisfied.
409-
- Feature is stable.
468+
- Feature stability: at least 2 releases between Beta and GA.
410469
- No bug reports / feedback / improvements to address.
411470

412471
### Upgrade / Downgrade Strategy
@@ -490,7 +549,7 @@ well as the [existing list] of feature gates.
490549

491550
- [X] Feature gate (also fill in values in `kep.yaml`)
492551
- Feature gate name: `MutableCSINodeAllocatableCount`
493-
- Components depending on the feature gate: `kube-apiserver`, `kube-controller-manager`, `kubelet`.
552+
- Components depending on the feature gate: `kube-apiserver`, `kubelet`.
494553

495554
###### Does enabling the feature change any default behavior?
496555

@@ -556,13 +615,28 @@ rollout. Similarly, consider large clusters and how enablement/disablement
556615
will rollout across nodes.
557616
-->
558617

618+
The rollout or rollback of this feature is designed such that it cannot fail in a way that impacts cluster operation.
619+
620+
During rollout, if the API server / Kubelet doesn't support the feature or if there's a version mismatch, update attempts to CSINode.Allocatable will fail gracefully, maintaining the existing value. This ensures that the worst-case scenario is simply a continuation of the current behavior, rather than a failure state.
621+
622+
For rollback, disabling the feature gate will immediately stop any updates to the allocatable property. Kubernetes will continue using the last known value, which may be outdated but won't cause operational issues.
623+
624+
In essence, the feature's best-effort nature and feature gate protection make it resilient against rollout or rollback failures. The primary risk is temporary inconsistency in reported capacities during transition periods, but this does not impact running workloads or overall cluster stability.
625+
559626
###### What specific metrics should inform a rollback?
560627

561628
<!--
562629
What signals should users be paying attention to when the feature is young
563630
that might indicate a serious problem?
564631
-->
565632

633+
Since this feature implements best-effort updates to CSINode.Allocatable, the only metrics that would necessitate a rollback are:
634+
635+
- Unexpected kubelet crashes after enabling the feature.
636+
- API server crashes related to CSINode updates.
637+
638+
In both cases, component crashes would be evident through standard monitoring of node and control plane health. Outside of these scenarios, there are no specific metrics that would require rolling back this feature, as failed updates simply maintain existing values.
639+
566640
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
567641

568642
<!--
@@ -571,12 +645,22 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
571645
are missing a bunch of machinery and tooling and can't do that now.
572646
-->
573647

648+
Yes, the following test scenarios were validated in the Alpha release:
649+
650+
- Upgrade path: API server and Kubelet upgrades were tested with the feature gate enabled, confirming that CSINode updates begin working once both components support the feature.
651+
652+
- Downgrade path: When the feature gate is disabled or components are downgraded, confirmed that CSINode.Allocatable remains at its last value and becomes immutable again.
653+
654+
- upgrade->downgrade->upgrade path: Verified that the full cycle works as expected, with CSINode updates resuming when the feature is re-enabled without requiring additional configuration.
655+
574656
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
575657

576658
<!--
577659
Even if applying deprecation policies, they may still surprise some users.
578660
-->
579661

662+
No.
663+
580664
### Monitoring Requirements
581665

582666
<!--
@@ -594,6 +678,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
594678
logs or events for this purpose.
595679
-->
596680

681+
An operator can determine if this feature is in use by checking the CSIDriver objects in their cluster for the `nodeAllocatableUpdatePeriodSeconds` field. If this field is set on a CSI driver, the feature is being used. This is similar to how operators check for other CSI capabilities through fields in the CSIDriver object, such as `fsGroupPolicy` or `podInfoOnMount`.
682+
597683
###### How can someone using this feature know that it is working for their instance?
598684

599685
<!--
@@ -605,13 +691,9 @@ and operation of this feature.
605691
Recall that end users cannot usually observe component logs or access metrics.
606692
-->
607693

608-
- [ ] Events
609-
- Event Reason:
610-
- [ ] API .status
611-
- Condition name:
612-
- Other field:
613-
- [ ] Other (treat as last resort)
614-
- Details:
694+
- [X] API .status
695+
- `VolumeAttachment.Status.Errors[].ErrorCode` will be populated with the gRPC error code when a `ResourceExhausted` error occurs during a driver's `ControllerPublishVolume` RPC.
696+
- `CSINode.Spec.Drivers[*].Allocatable.Count` will be updated periodically based on the `nodeAllocatableUpdatePeriodSeconds` configuration in the CSIDriver object.
615697

616698
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
617699

@@ -630,18 +712,19 @@ These goals will help you determine what you need to measure (SLIs) in the next
630712
question.
631713
-->
632714

715+
For this enhancement, the following SLOs are reasonable:
716+
717+
- 99.9% of CSINode updates (both periodic and reactive) should complete within 1 second of being triggered.
718+
- The introduction of this feature should not increase the overall API server error rate (5xx errors) by more than 0.1%.
719+
- No measurable impact on pod startup latency, as CSINode updates are performed asynchronously.
720+
633721
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
634722

635723
<!--
636724
Pick one more of these and delete the rest.
637725
-->
638726

639-
- [ ] Metrics
640-
- Metric name:
641-
- [Optional] Aggregation method:
642-
- Components exposing the metric:
643-
- [ ] Other (treat as last resort)
644-
- Details:
727+
Not applicable. The feature operates in a best-effort manner - either `CSINode.Spec.Drivers[*].Allocatable` gets updated or maintains its existing value. Standard API server and kubelet health metrics are sufficient to monitor the overall cluster health.
645728

646729
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
647730

@@ -650,6 +733,12 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
650733
implementation difficulties, etc.).
651734
-->
652735

736+
While the following metrics could provide additional visibility into the feature's operation, they weren't added because API server health metrics already indirectly measure the success of CSINode updates - if the API server is healthy, we expect updates to succeed:
737+
738+
`csi_node_updates_total`: Could track `CSINode.Spec.Drivers[*].Allocatable` updates attempted (periodic/reactive).
739+
`csi_node_update_errors_total`: Could track failed update attempts.
740+
`csi_node_update_duration_seconds`: Could track update latency.
741+
653742
### Dependencies
654743

655744
<!--
@@ -673,6 +762,10 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
673762
- Impact of its degraded performance or high-error rates on the feature:
674763
-->
675764

765+
This feature primarily depends on CSI drivers implementing the `NodeGetInfo` RPC to report volume attachment limits. If a CSI driver is unavailable, the `CSINode.Spec.Drivers[*].Allocatable` value remains at its last known value. Degraded performance or high error rates in CSI drivers may cause periodic or reactive updates to fail, but this only results in using the last known value, with no impact on existing workloads.
766+
767+
Beyond CSI drivers, which are already a requirement for volume operations, this feature introduces no additional service dependencies. It builds upon existing Kubernetes components (kubelet and API server) and their normal operation.
768+
676769
### Scalability
677770

678771
<!--
@@ -705,7 +798,7 @@ Yes, there will be new API calls to update the `CSINode` object:
705798
```
706799
API call type: PATCH
707800
Estimated throughput: Depends on the `NodeAllocatableUpdatePeriodSeconds` setting and the frequency of volume attachment failures.
708-
Originating component: Kubelet, KCM
801+
Originating component: Kubelet
709802
```
710803

711804
###### Will enabling / using this feature result in introducing new API types?
@@ -800,6 +893,8 @@ details). For now, we leave it here.
800893

801894
###### How does this feature react if the API server and/or etcd is unavailable?
802895

896+
When the API server is unavailable, `CSINode` update attempts fail and are logged, however, the periodic update goroutines will continue running and retry at their configured intervals. Additionally, `ResourceExhausted` errors cannot trigger immediate updates since `VolumeAttachment` statuses cannot be read. Existing allocatable values remain unchanged and stateful workloads continue running normally.
897+
803898
###### What are other known failure modes?
804899

805900
<!--
@@ -815,8 +910,12 @@ For each of them, fill in the following information by copying the below templat
815910
- Testing: Are there any tests for failure mode? If not, describe why.
816911
-->
817912

913+
No other known failure modes.
914+
818915
###### What steps should be taken if SLOs are not being met to determine the problem?
819916

917+
N/A
918+
820919
## Implementation History
821920

822921
<!--
@@ -830,6 +929,10 @@ Major milestones might include:
830929
- when the KEP was retired or superseded
831930
-->
832931

932+
- 2024-08-08 - Enhancement proposed in sig-storage.
933+
- 2024-09-25 - Enhancement officially submitted to Kubernetes.
934+
- 2025-04-23 - Kubernetes v1.33: Enhancement implemented and released in Alpha.
935+
833936
## Drawbacks
834937

835938
<!--

0 commit comments

Comments
 (0)