Skip to content

Commit a804a04

Browse files
committed
Address upgrade, resource usage and other related comments
Signed-off-by: Swati Gupta <swatig@nvidia.com>
1 parent 17d63f3 commit a804a04

File tree

1 file changed

+24
-7
lines changed
  • keps/sig-node/3695-pod-resources-for-dra

1 file changed

+24
-7
lines changed

keps/sig-node/3695-pod-resources-for-dra/README.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -358,7 +358,12 @@ Kubelet may fail to start. The new API may report inconsistent data, or may caus
358358

359359
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
360360

361-
Not Applicable.
361+
Not Applicable. Because this change:
362+
363+
- Is read-only in the kubelet’s in-memory state.
364+
- Is behind a feature gate, so turning it off simply disables the new endpoints without affecting any existing behavior.
365+
366+
In practice, restart the kubelet with the gate disabled (rollback) or re-enabled (upgrade), and the API behavior reverts or returns without loss of data or consistency. Therefore we don’t need a special upgrade/downgrade test matrix for this KEP.
362367

363368
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
364369

@@ -383,7 +388,9 @@ Call the PodResources API and see the result.
383388

384389
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
385390

386-
N/A.
391+
100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet.
392+
This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time.
393+
The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606.
387394

388395
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
389396

@@ -419,29 +426,35 @@ No.
419426

420427
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
421428

422-
No.
429+
No. Enabling this feature does not change the number of API objects returned. But it may increase the size of each object whenever there are Dynamic Resources to report where each ContainerResources now has an extra dynamic_resources field.
423430

424431
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
425432

426433
No. Feature is out of existing any paths in kubelet.
427434

428435
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
436+
Negligible amount of CPU and memory. Because the API is purely read-only and piggy-backs on the kubelet’s existing cache and checkpointing machinery, exposing Dynamic Resources incurs only similar minimal serialization and storage as CPUManager and DeviceManager—so any extra CPU, memory, disk, or I/O impact is negligible.
429437

430-
DDOSing the API can lead to resource exhaustion.
438+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
439+
440+
No, because the endpoint queries existing data structures inside the kubelet.
431441

432442
### Troubleshooting
433443

434444
###### How does this feature react if the API server and/or etcd is unavailable?
435445

436-
N/A.
446+
No impact, the feature is node-local.
437447

438448
###### What are other known failure modes?
439449

440-
The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may cause the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API.
450+
feature gate disabled: The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available.
451+
Bugs may cause the API to return unexpected errors, or to return inconsistent data.
452+
Consumers of the API should treat unexpected errors as bugs of this API.
441453

442454
###### What steps should be taken if SLOs are not being met to determine the problem?
443455

444-
N/A.
456+
Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606.
457+
Check the kubelet logs to learn about resource allocation errors.
445458

446459
## Implementation History
447460

@@ -453,4 +466,8 @@ N/A.
453466

454467
## Drawbacks
455468

469+
N/A
470+
456471
## Alternatives
472+
473+
N/A

0 commit comments

Comments
 (0)