Skip to content

Conversation

@ngopalak-redhat
Copy link
Contributor

@ngopalak-redhat ngopalak-redhat commented Nov 12, 2025

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.22+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
    - systemReservedCgroup is NOT present (empty/cleared)
    - enforceNodeAllocatable only contains ["pods"]
    - Kubelet starts successfully without errors
  5. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

Notes from testing

  • When setting the empty string for systemReservedCGroup, the the line in pkg/controller/kubelet-config/helpers.go:
    err = mergo.Merge(originalKubeConfig, specKubeletConfig, mergo.WithOverride)
    
    ignores it as mergo.WithOverwriteWithEmptyValue is not used. The impact of adding mergo.WithOverwriteWithEmptyValue could impact other keys in the kubeletconfig. Hence to reduce the blast radius an if condition is added:
     	if specKubeletConfig.SystemReservedCgroup == "" {
     	}
    
  • Adding a new e2e test will increase the time taken significantly for the overall test suite duration, hence enhancing an existing test case. As discussed in https://redhat-internal.slack.com/archives/CK1AE4ZCK/p1765210654986779 I have only added high level kubeletconfig test. We are in the process of defining a new test suite for testing other capabilities.

Stress testing
Tests were conducted to validate CPU usage behavior regarding system.slice weights versus hard limits under various load conditions.

  1. Behavior without Contention

Observation: With system-reserved-compressible enabled (500m limit / weight 20), a process in system.slice consumed a full CPU core (1000m) when other slices were idle.

Conclusion: Validated that CPU weights are not hard limits. As per kernel documentation, slices can burst to use available CPU if there is no contention from other slices.

  1. Behavior with Contention (4-core Node)

Test: Simultaneous load applied to system.slice (3 processes) and kubepods.slice (4 processes).

Result: system.slice usage correctly adhered to the configured threshold (did not exceed 500m).

Conclusion: Confirmed that CPU weights correctly enforce proportional distribution when the CPU is under stress.

  1. Large Scale Behavior (192-core Node)

Test: Auto node sizing applied (2.35 cores reserved). Stressed with 200 processes on kubepods and 50 on system.

Result: Observed usage was ~3.27 cores (calculated weight ~92).

Conclusion: Performance is within an acceptable range of the target reservation.

Documentation Update
The following note regarding the default behavior change should be added:

"By default in OpenShift 4.22 and later, system-reserved-compressible is enabled for all clusters that do not use the reserved CPU feature. This addresses previous issues where the system reserved CPU exceeded the desired limit. This default can be overridden by setting systemReservedCPU to "" in the kubelet configuration. Note: In rare cases where other slices are running CPU-intensive workloads, contention from slices other than system.slice and kubepods.slice may still impact overall CPU allocation."

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.


Related:

Decision Update

  • As per latest discussion, we plan to make this a default in OCP 4.22. The clusters upgraded from 4.20 also will have this enabled. The changes required for managing backward compatibility is more than just a machine config.

@ngopalak-redhat ngopalak-redhat changed the title Implement system-reserved-compressible WIP: Implement system-reserved-compressible Nov 12, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 12, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/system-reserved-compressible-1 branch from ca28d80 to 00bb8e1 Compare November 17, 2025 03:53
@ngopalak-redhat ngopalak-redhat changed the title WIP: Implement system-reserved-compressible OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift 4.21 Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 19, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

TODO: Before Review

  • Complete upgrade testing

What I did

This PR enables system-reserved-compressible enforcement by default for all new OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in new OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review November 20, 2025 00:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@ngopalak-redhat
Copy link
Contributor Author

cc: @MarSik @ffromani

}
// Validate that systemReservedCgroup matches systemCgroups if both are set
if kcDecoded.SystemReservedCgroup != "" && kcDecoded.SystemCgroups != "" {
if kcDecoded.SystemReservedCgroup != kcDecoded.SystemCgroups {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should both the values of SystemReservedCgroup and SystemCgroups match?
From the kubelet configuration doc I don't find such a condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).

If its not the same, the enforcement would happen on different cgroup while the calculation of the values would happen using SystemCgroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I'm still unclear on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more digging into this. If they were different, Kubelet would move system processes to one cgroup (via SystemCgroups) but enforce resource reservation on an empty or different cgroup (via SystemReservedCgroup), then the weights useless for those processes.

@ngopalak-redhat
Copy link
Contributor Author

@haircommander Please review

@ngopalak-redhat ngopalak-redhat marked this pull request as draft November 20, 2025 15:11
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@ngopalak-redhat ngopalak-redhat changed the title OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift 4.21 OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift Nov 25, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 25, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 25, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.22+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 25, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.22+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.22. The clusters upgraded from 4.20 also will have this enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat
Copy link
Contributor Author

/hold until OCP 4.22

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 25, 2025
@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review November 25, 2025 00:40
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
@ngopalak-redhat
Copy link
Contributor Author

cc: @harche

@ngopalak-redhat
Copy link
Contributor Author

Keeping in draft state to add a e2e test

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025
@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5f3903c0-db3f-11f0-9304-6254ced58bff-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/de46c180-dbe8-11f0-9b16-f4843203486f-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2,openshift/origin#30644

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@ngopalak-redhat
Copy link
Contributor Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c86ee1f0-e497-11f0-8a0b-ff5ebaa0ea08-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 2, 2026

@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@ngopalak-redhat
Copy link
Contributor Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 openshift/origin#30644

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 2, 2026

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/121ace70-e797-11f0-9238-0cc470394ced-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 2, 2026

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5fc5bd50-e7ca-11f0-9c72-75e7f7ff3b1b-0

@ngopalak-redhat
Copy link
Contributor Author

/test all

@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/system-reserved-compressible-1 branch from ab9b00b to 308744e Compare January 2, 2026 11:05
@ngopalak-redhat
Copy link
Contributor Author

/test all

@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/system-reserved-compressible-1 branch from 308744e to 6abd4be Compare January 5, 2026 01:56
@ngopalak-redhat
Copy link
Contributor Author

/test all

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 5, 2026

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.22+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

Notes from testing

  • When setting the empty string for systemReservedCGroup, the the line in pkg/controller/kubelet-config/helpers.go:
    err = mergo.Merge(originalKubeConfig, specKubeletConfig, mergo.WithOverride)
    
    ignores it as mergo.WithOverwriteWithEmptyValue is not used. The impact of adding mergo.WithOverwriteWithEmptyValue could impact other keys in the kubeletconfig. Hence to reduce the blast radius an if condition is added:
    	if specKubeletConfig.SystemReservedCgroup == "" {
    	}
    
  • Adding a new e2e test will increase the time taken significantly for the overall test suite duration, hence enhancing an existing test case. As discussed in https://redhat-internal.slack.com/archives/CK1AE4ZCK/p1765210654986779 I have only added high level kubeletconfig test. We are in the process of defining a new test suite for testing other capabilities.

Stress testing
Tests were conducted to validate CPU usage behavior regarding system.slice weights versus hard limits under various load conditions.

  1. Behavior without Contention

Observation: With system-reserved-compressible enabled (500m limit / weight 20), a process in system.slice consumed a full CPU core (1000m) when other slices were idle.

Conclusion: Validated that CPU weights are not hard limits. As per kernel documentation, slices can burst to use available CPU if there is no contention from other slices.

  1. Behavior with Contention (4-core Node)

Test: Simultaneous load applied to system.slice (3 processes) and kubepods.slice (4 processes).

Result: system.slice usage correctly adhered to the configured threshold (did not exceed 500m).

Conclusion: Confirmed that CPU weights correctly enforce proportional distribution when the CPU is under stress.

  1. Large Scale Behavior (192-core Node)

Test: Auto node sizing applied (2.35 cores reserved). Stressed with 200 processes on kubepods and 50 on system.

Result: Observed usage was ~3.27 cores (calculated weight ~92).

Conclusion: Performance is within an acceptable range of the target reservation.

Documentation Update
The following note regarding the default behavior change should be added:

"By default in OpenShift 4.22 and later, system-reserved-compressible is enabled for all clusters that do not use the reserved CPU feature. This addresses previous issues where the system reserved CPU exceeded the desired limit. This default can be overridden by setting systemReservedCPU to "" in the kubelet configuration. Note: In rare cases where other slices are running CPU-intensive workloads, contention from slices other than system.slice and kubepods.slice may still impact overall CPU allocation."

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.22+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts.


Related:

Decision Update

  • As per latest discussion, we plan to make this a default in OCP 4.22. The clusters upgraded from 4.20 also will have this enabled. The changes required for managing backward compatibility is more than just a machine config.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review January 5, 2026 06:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
@openshift-ci openshift-ci bot requested a review from umohnani8 January 5, 2026 06:17
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 5, 2026

@ngopalak-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-images 6abd4be link true /test okd-scos-images
ci/prow/bootstrap-unit 6abd4be link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants