-
Notifications
You must be signed in to change notification settings - Fork 462
OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift #5408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift #5408
Conversation
|
Skipping CI for Draft Pull Request. |
ca28d80 to
00bb8e1
Compare
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| } | ||
| // Validate that systemReservedCgroup matches systemCgroups if both are set | ||
| if kcDecoded.SystemReservedCgroup != "" && kcDecoded.SystemCgroups != "" { | ||
| if kcDecoded.SystemReservedCgroup != kcDecoded.SystemCgroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should both the values of SystemReservedCgroup and SystemCgroups match?
From the kubelet configuration doc I don't find such a condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).
If its not the same, the enforcement would happen on different cgroup while the calculation of the values would happen using SystemCgroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I'm still unclear on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more digging into this. If they were different, Kubelet would move system processes to one cgroup (via SystemCgroups) but enforce resource reservation on an empty or different cgroup (via SystemReservedCgroup), then the weights useless for those processes.
|
@haircommander Please review |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold until OCP 4.22 |
|
cc: @harche |
|
Keeping in draft state to add a e2e test |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5f3903c0-db3f-11f0-9304-6254ced58bff-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/de46c180-dbe8-11f0-9b16-f4843203486f-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644 |
|
@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2,openshift/origin#30644 |
|
@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c86ee1f0-e497-11f0-8a0b-ff5ebaa0ea08-0 |
|
/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644 |
|
@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/121ace70-e797-11f0-9238-0cc470394ced-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5fc5bd50-e7ca-11f0-9c72-75e7f7ff3b1b-0 |
|
/test all |
ab9b00b to
308744e
Compare
|
/test all |
308744e to
6abd4be
Compare
|
/test all |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What I did
This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.22+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.
Template Changes:
/system.sliceto default kubelet configuration for all node types (master, worker, arbiter)enforceNodeAllocatablealongside pods in kubelet template filesPerformance Profile Compatibility:
The kubelet cannot simultaneously enforce both
systemReservedCgroupand--reserved-cpus(used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:--reserved-cpus) is setenforceNodeAllocatable to ["pods"]only in this scenarioThis approach leverages the fact that
--reserved-cpusalready supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.Validation:
systemReservedCgroupmatchessystemCgroupswhen both are user-specifiedHow to verify it
For New OCP 4.21+ Clusters:
cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
systemReservedCgroup: /system.slice
enforceNodeAllocatable:
For Clusters with Performance Profiles:
cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
- systemReservedCgroup is NOT present (empty/cleared)
- enforceNodeAllocatable only contains ["pods"]
- Kubelet starts successfully without errors
journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"
Notes from testing
pkg/controller/kubelet-config/helpers.go:mergo.WithOverwriteWithEmptyValueis not used. The impact of addingmergo.WithOverwriteWithEmptyValuecould impact other keys in the kubeletconfig. Hence to reduce the blast radius an if condition is added:Stress testing
Tests were conducted to validate CPU usage behavior regarding system.slice weights versus hard limits under various load conditions.
Observation: With system-reserved-compressible enabled (500m limit / weight 20), a process in system.slice consumed a full CPU core (1000m) when other slices were idle.
Conclusion: Validated that CPU weights are not hard limits. As per kernel documentation, slices can burst to use available CPU if there is no contention from other slices.
Test: Simultaneous load applied to system.slice (3 processes) and kubepods.slice (4 processes).
Result: system.slice usage correctly adhered to the configured threshold (did not exceed 500m).
Conclusion: Confirmed that CPU weights correctly enforce proportional distribution when the CPU is under stress.
Test: Auto node sizing applied (2.35 cores reserved). Stressed with 200 processes on kubepods and 50 on system.
Result: Observed usage was ~3.27 cores (calculated weight ~92).
Conclusion: Performance is within an acceptable range of the target reservation.
Documentation Update
The following note regarding the default behavior change should be added:
"By default in OpenShift 4.22 and later, system-reserved-compressible is enabled for all clusters that do not use the reserved CPU feature. This addresses previous issues where the system reserved CPU exceeded the desired limit. This default can be overridden by setting systemReservedCPU to "" in the kubelet configuration. Note: In rare cases where other slices are running CPU-intensive workloads, contention from slices other than system.slice and kubepods.slice may still impact overall CPU allocation."
Description for the changelog
Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.
Related:
Decision Update