Fix reconcile physical restore #2096

tudor-timcu · 2025-10-22T13:56:42Z

CHANGE DESCRIPTION

Problem:
During physical restore operations with PITR enabled, the reconciliation loop could get stuck repeatedly attempting to prepare replsets without detecting if the preparation had already started or completed. This could lead to unnecessary repeated preparation calls before verifying if preparation succeeded in the current reconcile cycle, potentially causing restore operations to stall indefinitely.

Cause:
The reconcile logic lacked state tracking for the physical restore preparation phase. When replsets were not ready, the controller would call prepareReplsetsForPhysicalRestore() and immediately return without rechecking if the preparation completed successfully. Additionally, there was no mechanism to prevent duplicate preparation attempts during concurrent reconcile loops or to detect stuck preparation operations, which could cause the restore process to loop without progress.

Solution:
Introduced an annotation-based tracking mechanism (operator.percona.com/physical-prepare-started) to mark when the prepare phase begins and ends. The solution implements:

Skipping redundant prepare operations if a prepare is already in progress (less than 10 minutes old)
Auto-expiring stuck prepare markers after 10 minutes to allow retry
Re-checking replset readiness after preparation to allow the restore to proceed in the same reconcile cycle if preparation completes successfully
Clearing the prepare marker deterministically when readiness turns true
This ensures the reconcile loop makes forward progress and avoids infinite loops while maintaining idempotency.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported MongoDB version?
Does the change support oldest and newest supported Kubernetes version?

CLAassistant · 2025-10-22T13:56:49Z

All committers have signed the CLA.

JNKPercona · 2025-11-11T10:56:47Z

Test Name	Result	Time
arbiter	failure	00:00:00
balancer	failure	00:00:00
cross-site-sharded	failure	00:00:00
custom-replset-name	failure	00:00:00
custom-tls	failure	00:00:00
custom-users-roles	failure	00:00:00
custom-users-roles-sharded	failure	00:00:00
data-at-rest-encryption	failure	00:00:00
data-sharded	failure	00:00:00
demand-backup	failure	00:00:00
demand-backup-eks-credentials-irsa	skipped	00:00:00
demand-backup-fs	skipped	00:00:00
demand-backup-if-unhealthy	skipped	00:00:00
demand-backup-incremental	skipped	00:00:00
demand-backup-incremental-sharded	skipped	00:00:00
demand-backup-physical-parallel	skipped	00:00:00
demand-backup-physical-aws	skipped	00:00:00
demand-backup-physical-azure	skipped	00:00:00
demand-backup-physical-gcp-s3	skipped	00:00:00
demand-backup-physical-gcp-native	skipped	00:00:00
demand-backup-physical-minio	skipped	00:00:00
demand-backup-physical-sharded-parallel	skipped	00:00:00
demand-backup-physical-sharded-aws	skipped	00:00:00
demand-backup-physical-sharded-azure	skipped	00:00:00
demand-backup-physical-sharded-gcp-native	skipped	00:00:00
demand-backup-physical-sharded-minio	skipped	00:00:00
demand-backup-sharded	skipped	00:00:00
expose-sharded	skipped	00:00:00
finalizer	skipped	00:00:00
ignore-labels-annotations	skipped	00:00:00
init-deploy	skipped	00:00:00
ldap	skipped	00:00:00
ldap-tls	skipped	00:00:00
limits	skipped	00:00:00
liveness	skipped	00:00:00
mongod-major-upgrade	skipped	00:00:00
mongod-major-upgrade-sharded	skipped	00:00:00
monitoring-2-0	skipped	00:00:00
monitoring-pmm3	skipped	00:00:00
multi-cluster-service	skipped	00:00:00
multi-storage	skipped	00:00:00
non-voting-and-hidden	skipped	00:00:00
one-pod	skipped	00:00:00
operator-self-healing-chaos	skipped	00:00:00
pitr	skipped	00:00:00
pitr-physical	skipped	00:00:00
pitr-sharded	skipped	00:00:00
pitr-to-new-cluster	skipped	00:00:00
pitr-physical-backup-source	skipped	00:00:00
preinit-updates	skipped	00:00:00
pvc-resize	skipped	00:00:00
recover-no-primary	skipped	00:00:00
replset-overrides	skipped	00:00:00
rs-shard-migration	skipped	00:00:00
scaling	skipped	00:00:00
scheduled-backup	skipped	00:00:00
security-context	skipped	00:00:00
self-healing-chaos	skipped	00:00:00
service-per-pod	skipped	00:00:00
serviceless-external-nodes	skipped	00:00:00
smart-update	skipped	00:00:00
split-horizon	skipped	00:00:00
stable-resource-version	skipped	00:00:00
storage	skipped	00:00:00
tls-issue-cert-manager	skipped	00:00:00
upgrade	skipped	00:00:00
upgrade-consistency	skipped	00:00:00
upgrade-consistency-sharded-tls	skipped	00:00:00
upgrade-sharded	skipped	00:00:00
upgrade-partial-backup	skipped	00:00:00
users	skipped	00:00:00
version-service	skipped	00:00:00

Summary	Value
Tests Run	10/72
Job Duration	01:49:08
Total Test Time	N/A

commit: 6e37bb2
image: perconalab/percona-server-mongodb-operator:PR-2096-6e37bb22

Fix reconcile physical restore

8a579af

tudor-timcu requested review from egegunes, gkech, hors, nmarukovich and pooknull as code owners October 22, 2025 13:56

pull-request-size bot added the size/M 30-99 lines label Oct 22, 2025

Merge branch 'main' into fix-reconcile-physical-restore

106af40

hors added the community label Oct 23, 2025

tudor-timcu added 2 commits October 28, 2025 12:06

Merge branch 'main' into fix-reconcile-physical-restore

16154bd

Merge branch 'main' into fix-reconcile-physical-restore

6e37bb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix reconcile physical restore #2096

Fix reconcile physical restore #2096

tudor-timcu commented Oct 22, 2025 •

edited by pull-request-badge bot

Loading

Uh oh!

CLAassistant commented Oct 22, 2025 •

edited

Loading

Uh oh!

JNKPercona commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix reconcile physical restore #2096

Are you sure you want to change the base?

Fix reconcile physical restore #2096

Conversation

tudor-timcu commented Oct 22, 2025 • edited by pull-request-badge bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

CLAassistant commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JNKPercona commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tudor-timcu commented Oct 22, 2025 •

edited by pull-request-badge bot

Loading

CLAassistant commented Oct 22, 2025 •

edited

Loading