Skip to content

Conversation

@tudor-timcu
Copy link

@tudor-timcu tudor-timcu commented Oct 22, 2025

CHANGE DESCRIPTION

Problem:
During physical restore operations with PITR enabled, the reconciliation loop could get stuck repeatedly attempting to prepare replsets without detecting if the preparation had already started or completed. This could lead to unnecessary repeated preparation calls before verifying if preparation succeeded in the current reconcile cycle, potentially causing restore operations to stall indefinitely.

Cause:
The reconcile logic lacked state tracking for the physical restore preparation phase. When replsets were not ready, the controller would call prepareReplsetsForPhysicalRestore() and immediately return without rechecking if the preparation completed successfully. Additionally, there was no mechanism to prevent duplicate preparation attempts during concurrent reconcile loops or to detect stuck preparation operations, which could cause the restore process to loop without progress.

Solution:
Introduced an annotation-based tracking mechanism (operator.percona.com/physical-prepare-started) to mark when the prepare phase begins and ends. The solution implements:

  • Skipping redundant prepare operations if a prepare is already in progress (less than 10 minutes old)
  • Auto-expiring stuck prepare markers after 10 minutes to allow retry
  • Re-checking replset readiness after preparation to allow the restore to proceed in the same reconcile cycle if preparation completes successfully
  • Clearing the prepare marker deterministically when readiness turns true
    This ensures the reconcile loop makes forward progress and avoids infinite loops while maintaining idempotency.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@CLAassistant
Copy link

CLAassistant commented Oct 22, 2025

CLA assistant check
All committers have signed the CLA.

@hors hors added the community label Oct 23, 2025
@JNKPercona
Copy link
Collaborator

Test Name Result Time
arbiter failure 00:00:00
balancer failure 00:00:00
cross-site-sharded failure 00:00:00
custom-replset-name failure 00:00:00
custom-tls failure 00:00:00
custom-users-roles failure 00:00:00
custom-users-roles-sharded failure 00:00:00
data-at-rest-encryption failure 00:00:00
data-sharded failure 00:00:00
demand-backup failure 00:00:00
demand-backup-eks-credentials-irsa skipped 00:00:00
demand-backup-fs skipped 00:00:00
demand-backup-if-unhealthy skipped 00:00:00
demand-backup-incremental skipped 00:00:00
demand-backup-incremental-sharded skipped 00:00:00
demand-backup-physical-parallel skipped 00:00:00
demand-backup-physical-aws skipped 00:00:00
demand-backup-physical-azure skipped 00:00:00
demand-backup-physical-gcp-s3 skipped 00:00:00
demand-backup-physical-gcp-native skipped 00:00:00
demand-backup-physical-minio skipped 00:00:00
demand-backup-physical-sharded-parallel skipped 00:00:00
demand-backup-physical-sharded-aws skipped 00:00:00
demand-backup-physical-sharded-azure skipped 00:00:00
demand-backup-physical-sharded-gcp-native skipped 00:00:00
demand-backup-physical-sharded-minio skipped 00:00:00
demand-backup-sharded skipped 00:00:00
expose-sharded skipped 00:00:00
finalizer skipped 00:00:00
ignore-labels-annotations skipped 00:00:00
init-deploy skipped 00:00:00
ldap skipped 00:00:00
ldap-tls skipped 00:00:00
limits skipped 00:00:00
liveness skipped 00:00:00
mongod-major-upgrade skipped 00:00:00
mongod-major-upgrade-sharded skipped 00:00:00
monitoring-2-0 skipped 00:00:00
monitoring-pmm3 skipped 00:00:00
multi-cluster-service skipped 00:00:00
multi-storage skipped 00:00:00
non-voting-and-hidden skipped 00:00:00
one-pod skipped 00:00:00
operator-self-healing-chaos skipped 00:00:00
pitr skipped 00:00:00
pitr-physical skipped 00:00:00
pitr-sharded skipped 00:00:00
pitr-to-new-cluster skipped 00:00:00
pitr-physical-backup-source skipped 00:00:00
preinit-updates skipped 00:00:00
pvc-resize skipped 00:00:00
recover-no-primary skipped 00:00:00
replset-overrides skipped 00:00:00
rs-shard-migration skipped 00:00:00
scaling skipped 00:00:00
scheduled-backup skipped 00:00:00
security-context skipped 00:00:00
self-healing-chaos skipped 00:00:00
service-per-pod skipped 00:00:00
serviceless-external-nodes skipped 00:00:00
smart-update skipped 00:00:00
split-horizon skipped 00:00:00
stable-resource-version skipped 00:00:00
storage skipped 00:00:00
tls-issue-cert-manager skipped 00:00:00
upgrade skipped 00:00:00
upgrade-consistency skipped 00:00:00
upgrade-consistency-sharded-tls skipped 00:00:00
upgrade-sharded skipped 00:00:00
upgrade-partial-backup skipped 00:00:00
users skipped 00:00:00
version-service skipped 00:00:00
Summary Value
Tests Run 10/72
Job Duration 01:49:08
Total Test Time N/A

commit: 6e37bb2
image: perconalab/percona-server-mongodb-operator:PR-2096-6e37bb22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants