OCPBUGS-69394: improve conflict error handling in hosted cluster status updates #7414

jparrill · 2025-12-19T09:24:58Z

Summary

This PR improves conflict error handling in the HostedCluster controller by properly handling API resource version conflicts during status updates.

This is part of the resolution of OCPBUGS-69394, still need more info about the behavior after this PR on Integration. The reasoning behind this is:

The Schedule object triggers an OADP backup
The backup triggers the pause
HO tries to set the status of other changes meanwhile this Pause is triggered also
The controllers tries to update the HC ending on multiples conflicts and triggering the rate limiter multiple times
It ends on a long wait until the Pause is propagated to the HCP.

This PR solves the status changes from HO to ensure the error handling is well managed.

Problem

The HostedCluster controller was experiencing reconciliation failures when concurrent status updates occurred, leading to resource version conflicts. These conflicts were being treated as hard errors rather than temporary conditions that should trigger requeue.

Solution

Add IsConflict() checks for all Status().Update() calls in the hosted cluster controller
Return appropriate requeue results when conflicts occur instead of propagating errors
Maintain proper error handling for non-conflict status update failures

Changes

File: hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
Type: Bug fix for status update conflict handling
Impact: Reduces controller reconciliation errors due to concurrent status updates

Testing

Changes follow existing error handling patterns
Conflict scenarios now return requeue instead of error
Non-conflict errors are still properly propagated

Fixes

OCPBUGS-69394

Related Issues

Addresses status update conflicts in hosted cluster controller reconciliation
Improves controller stability under concurrent operations

🤖 Generated with Claude Code

Add proper conflict resolution for Status().Update() calls throughout the hosted cluster controller. When API conflicts occur, return requeue instead of propagating errors to prevent unnecessary error states. This resolves issues where concurrent status updates could cause controller reconciliation failures due to resource version conflicts. Changes: - Add IsConflict() checks in hosted cluster status updates - Return appropriate requeue results for conflict scenarios - Maintain error handling for non-conflict status update failures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

openshift-ci-robot · 2025-12-19T09:25:04Z

@jparrill: This pull request references Jira Issue OCPBUGS-69394, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

This PR improves conflict error handling in the HostedCluster controller by properly handling API resource version conflicts during status updates.

Problem

The HostedCluster controller was experiencing reconciliation failures when concurrent status updates occurred, leading to resource version conflicts. These conflicts were being treated as hard errors rather than temporary conditions that should trigger requeue.

Solution

Add IsConflict() checks for all Status().Update() calls in the hosted cluster controller

Return appropriate requeue results when conflicts occur instead of propagating errors

Maintain proper error handling for non-conflict status update failures

Changes

File: hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

Type: Bug fix for status update conflict handling

Impact: Reduces controller reconciliation errors due to concurrent status updates

Testing

Changes follow existing error handling patterns

Conflict scenarios now return requeue instead of error

Non-conflict errors are still properly propagated

Fixes

OCPBUGS-69394

Related Issues

Addresses status update conflicts in hosted cluster controller reconciliation

Improves controller stability under concurrent operations

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-12-19T09:25:14Z

Walkthrough

Modified the hosted cluster controller to implement robust conflict handling across multiple reconciliation paths. When Kubernetes API updates encounter conflicts due to optimistic concurrency, the controller now requeues operations instead of failing, replacing direct error returns with requeue responses throughout status updates and finalizer removal operations.

Changes

Cohort / File(s)	Summary
Conflict-aware reconciliation logic `hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`	Added conflict-aware error handling to multiple reconciliation paths. Status updates, platform state persistence, and finalizer removals now requeue on conflicts rather than propagate errors. Modified `defaultClusterIDsIfNeeded` to return nil on conflicts, allowing automatic requeue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify that all r.Client.Status().Update() and r.Update() calls properly handle conflicts by returning ctrl.Result{Requeue: true}
Ensure defaultClusterIDsIfNeeded conflict handling consistently returns nil without suppressing other error types
Validate that the requeue pattern is applied uniformly across AWS/Azure platform state persistence and finalizer removal branches
Confirm no legitimate errors are inadvertently masked by overly broad conflict handling

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between ff985bc and 6e4048f.

📒 Files selected for processing (1)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (15 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

🔇 Additional comments (5)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (5)

373-378: LGTM! Proper conflict handling for status updates.

The implementation correctly distinguishes between transient conflicts (which trigger requeue) and genuine errors (which are aggregated and propagated). This pattern prevents reconciliation failures from resource version conflicts during concurrent status updates.

424-426: LGTM! Consistent conflict handling across all status updates.

Excellent consistency in applying the conflict handling pattern across all status update operations including:

AWS identity provider validation

Security group deletion states

Cloud resource destruction

Grace period handling

Platform credentials status

Backup restoration status

OIDC configuration

This comprehensive approach ensures the controller handles concurrent operations gracefully throughout the reconciliation lifecycle.

Also applies to: 454-456, 523-525, 598-600, 1372-1374, 1388-1390, 1438-1440, 2031-2033, 2046-2048

618-620: LGTM! Appropriate conflict handling for finalizer removal.

Correctly applies conflict handling to the finalizer removal operation. Since finalizer updates modify the object spec, they can encounter resource version conflicts and benefit from the requeue pattern.

1793-1795: LGTM! Conflict handling extends appropriately to resource reconciliation.

Good practice extending conflict handling to createOrUpdate operations for CAPI cluster and Azure SecretProviderClass resources. The createOrUpdate wrapper uses controller-runtime's CreateOrUpdate which can return conflicts during updates, so this handling improves stability during concurrent resource modifications.

Also applies to: 2059-2061, 2071-2073

4453-4455: LGTM! Correct conflict handling for error-returning helper function.

The pattern here differs appropriately from status update handlers:

defaultClusterIDsIfNeeded returns error type, not ctrl.Result

Returning nil on conflict is correct - it avoids treating transient conflicts as hard errors

The comment accurately describes the behavior: conflicts indicate concurrent modifications that will trigger automatic reconciliation via watch events

This is the proper pattern for helper functions that cannot directly control reconciliation flow.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jparrill · 2025-12-19T09:25:27Z

/auto-cc

jparrill · 2025-12-19T09:26:26Z

@coderabbitai review

coderabbitai · 2025-12-19T09:26:32Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

openshift-ci · 2025-12-19T09:26:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2025

openshift-ci bot added the do-not-merge/needs-area label Dec 19, 2025

jparrill marked this pull request as draft December 19, 2025 09:25

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2025

openshift-ci bot requested review from bryan-cox and csrwng December 19, 2025 09:26

openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 19, 2025

openshift-ci bot requested review from devguyio and sjenning December 19, 2025 09:26

openshift-ci bot removed the do-not-merge/needs-area label Dec 19, 2025

mtulio mentioned this pull request Dec 19, 2025

OCPBUGS-65885: added 1.34 required ELBv2 perms to CCM role #7339

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-69394: improve conflict error handling in hosted cluster status updates #7414

OCPBUGS-69394: improve conflict error handling in hosted cluster status updates #7414

Uh oh!

jparrill commented Dec 19, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Dec 19, 2025

Summary

Problem

Solution

Changes

Testing

Fixes

Related Issues

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

jparrill commented Dec 19, 2025

Uh oh!

jparrill commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Uh oh!

openshift-ci bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OCPBUGS-69394: improve conflict error handling in hosted cluster status updates #7414

Are you sure you want to change the base?

OCPBUGS-69394: improve conflict error handling in hosted cluster status updates #7414

Uh oh!

Conversation

jparrill commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Fixes

Related Issues

Uh oh!

openshift-ci-robot commented Dec 19, 2025

Summary

Problem

Solution

Changes

Testing

Fixes

Related Issues

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

jparrill commented Dec 19, 2025

Uh oh!

jparrill commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Uh oh!

openshift-ci bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jparrill commented Dec 19, 2025 •

edited

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading