Skip to content

Conversation

@jparrill
Copy link
Contributor

@jparrill jparrill commented Dec 19, 2025

Summary

This PR improves conflict error handling in the HostedCluster controller by properly handling API resource version conflicts during status updates.

This is part of the resolution of OCPBUGS-69394, still need more info about the behavior after this PR on Integration. The reasoning behind this is:

  • The Schedule object triggers an OADP backup
  • The backup triggers the pause
  • HO tries to set the status of other changes meanwhile this Pause is triggered also
  • The controllers tries to update the HC ending on multiples conflicts and triggering the rate limiter multiple times
  • It ends on a long wait until the Pause is propagated to the HCP.

This PR solves the status changes from HO to ensure the error handling is well managed.

Problem

The HostedCluster controller was experiencing reconciliation failures when concurrent status updates occurred, leading to resource version conflicts. These conflicts were being treated as hard errors rather than temporary conditions that should trigger requeue.

Solution

  • Add IsConflict() checks for all Status().Update() calls in the hosted cluster controller
  • Return appropriate requeue results when conflicts occur instead of propagating errors
  • Maintain proper error handling for non-conflict status update failures

Changes

  • File: hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • Type: Bug fix for status update conflict handling
  • Impact: Reduces controller reconciliation errors due to concurrent status updates

Testing

  • Changes follow existing error handling patterns
  • Conflict scenarios now return requeue instead of error
  • Non-conflict errors are still properly propagated

Fixes

Related Issues

  • Addresses status update conflicts in hosted cluster controller reconciliation
  • Improves controller stability under concurrent operations

🤖 Generated with Claude Code

Add proper conflict resolution for Status().Update() calls throughout
the hosted cluster controller. When API conflicts occur, return requeue
instead of propagating errors to prevent unnecessary error states.

This resolves issues where concurrent status updates could cause
controller reconciliation failures due to resource version conflicts.

Changes:
- Add IsConflict() checks in hosted cluster status updates
- Return appropriate requeue results for conflict scenarios
- Maintain error handling for non-conflict status update failures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2025
@openshift-ci-robot
Copy link

@jparrill: This pull request references Jira Issue OCPBUGS-69394, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

This PR improves conflict error handling in the HostedCluster controller by properly handling API resource version conflicts during status updates.

Problem

The HostedCluster controller was experiencing reconciliation failures when concurrent status updates occurred, leading to resource version conflicts. These conflicts were being treated as hard errors rather than temporary conditions that should trigger requeue.

Solution

  • Add IsConflict() checks for all Status().Update() calls in the hosted cluster controller
  • Return appropriate requeue results when conflicts occur instead of propagating errors
  • Maintain proper error handling for non-conflict status update failures

Changes

  • File: hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • Type: Bug fix for status update conflict handling
  • Impact: Reduces controller reconciliation errors due to concurrent status updates

Testing

  • Changes follow existing error handling patterns
  • Conflict scenarios now return requeue instead of error
  • Non-conflict errors are still properly propagated

Fixes

Related Issues

  • Addresses status update conflicts in hosted cluster controller reconciliation
  • Improves controller stability under concurrent operations

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Modified the hosted cluster controller to implement robust conflict handling across multiple reconciliation paths. When Kubernetes API updates encounter conflicts due to optimistic concurrency, the controller now requeues operations instead of failing, replacing direct error returns with requeue responses throughout status updates and finalizer removal operations.

Changes

Cohort / File(s) Summary
Conflict-aware reconciliation logic
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
Added conflict-aware error handling to multiple reconciliation paths. Status updates, platform state persistence, and finalizer removals now requeue on conflicts rather than propagate errors. Modified defaultClusterIDsIfNeeded to return nil on conflicts, allowing automatic requeue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify that all r.Client.Status().Update() and r.Update() calls properly handle conflicts by returning ctrl.Result{Requeue: true}
  • Ensure defaultClusterIDsIfNeeded conflict handling consistently returns nil without suppressing other error types
  • Validate that the requeue pattern is applied uniformly across AWS/Azure platform state persistence and finalizer removal branches
  • Confirm no legitimate errors are inadvertently masked by overly broad conflict handling
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between ff985bc and 6e4048f.

📒 Files selected for processing (1)
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (15 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
🔇 Additional comments (5)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (5)

373-378: LGTM! Proper conflict handling for status updates.

The implementation correctly distinguishes between transient conflicts (which trigger requeue) and genuine errors (which are aggregated and propagated). This pattern prevents reconciliation failures from resource version conflicts during concurrent status updates.


424-426: LGTM! Consistent conflict handling across all status updates.

Excellent consistency in applying the conflict handling pattern across all status update operations including:

  • AWS identity provider validation
  • Security group deletion states
  • Cloud resource destruction
  • Grace period handling
  • Platform credentials status
  • Backup restoration status
  • OIDC configuration

This comprehensive approach ensures the controller handles concurrent operations gracefully throughout the reconciliation lifecycle.

Also applies to: 454-456, 523-525, 598-600, 1372-1374, 1388-1390, 1438-1440, 2031-2033, 2046-2048


618-620: LGTM! Appropriate conflict handling for finalizer removal.

Correctly applies conflict handling to the finalizer removal operation. Since finalizer updates modify the object spec, they can encounter resource version conflicts and benefit from the requeue pattern.


1793-1795: LGTM! Conflict handling extends appropriately to resource reconciliation.

Good practice extending conflict handling to createOrUpdate operations for CAPI cluster and Azure SecretProviderClass resources. The createOrUpdate wrapper uses controller-runtime's CreateOrUpdate which can return conflicts during updates, so this handling improves stability during concurrent resource modifications.

Also applies to: 2059-2061, 2071-2073


4453-4455: LGTM! Correct conflict handling for error-returning helper function.

The pattern here differs appropriately from status update handlers:

  • defaultClusterIDsIfNeeded returns error type, not ctrl.Result
  • Returning nil on conflict is correct - it avoids treating transient conflicts as hard errors
  • The comment accurately describes the behavior: conflicts indicate concurrent modifications that will trigger automatic reconciliation via watch events

This is the proper pattern for helper functions that cannot directly control reconciliation flow.


Comment @coderabbitai help to get the list of available commands and usage tips.

@jparrill
Copy link
Contributor Author

/auto-cc

@jparrill jparrill marked this pull request as draft December 19, 2025 09:25
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2025
@jparrill
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@openshift-ci openshift-ci bot requested review from bryan-cox and csrwng December 19, 2025 09:26
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 19, 2025
@openshift-ci openshift-ci bot requested review from devguyio and sjenning December 19, 2025 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants