Skip to content

Conversation

@xing-yang
Copy link
Contributor

@xing-yang xing-yang commented Nov 25, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
The external-provisioner should continue to retry provisioning the PVC by calling the CSI driver even if VolumeSnapshot is being deleted. CSI driver should be idempotent and it will handle retries.

Which issue(s) this PR fixes:
Fixes #1449

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Allow provisioning to proceed when snapshot is being deleted to prevent leaking volumes and snapshots.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Nov 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 25, 2025
@xing-yang xing-yang changed the title WIP: Allow provisioning to proceed to prevent leaking resources Allow provisioning to proceed to prevent leaking resources Nov 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
@xing-yang
Copy link
Contributor Author

/assign @jsafrane

@jsafrane
Copy link
Contributor

I can see it makes sense to continue provisioning when it started with a valid snapshot. But does it make sense to start it with a snapshot that is being deleted? Do we have a good way how to distinguish those two?

For volume cloning, the provisioner adds a finalizer to PVC that is being cloned. Should it add a similar finalizer to VolumeSnapshot that is being restored? We could then easily see if provisioning started before a VolumeSnapshot got DeletionTimestamp:

  • If the finalizer exists, it must have been added before the snapshot got DeletionTimestamp (the API server checks that) and therefore the provisioning started and the provisioner should continue. Once it succeeds, remove the finalizer.
  • If the finalizer does not exist, try to add it. The API server will refuse to add a new finalizer to an object with DeletionTimestamp. Provisioning did not start, reject it with an error.

@xing-yang
Copy link
Contributor Author

That's a good point. We do have a finalizer added to VolumeSnapshot being used as a source to create PVC. I can add a check for that to differentiate an already started provisioning from a new provisioning case.

Check for finalizers on VolumeSnapshot to differentiate between
provisioning that has already started vs new provisioning attempts.

When a VolumeSnapshot has DeletionTimestamp set:
- If it has finalizers: provisioning was started before deletion,
  continue to prevent resource leaks. The external-snapshotter adds
  finalizers when a snapshot is used as a data source.
- If it has no finalizers: this is a new provisioning attempt,
  reject with an error.

This ensures the CSI driver can complete in-flight provisioning
operations and properly clean up resources, while preventing new
provisioning from snapshots that are being deleted.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 1, 2025
snapshotNotBound = "snapshot %s not bound"

pvcCloneFinalizer = "provisioner.storage.kubernetes.io/cloning-protection"
pvcCloneFinalizer = "provisioner.storage.kubernetes.io/cloning-protection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we discuss the approach between finalizer and in memory cache or persistent storage?

Copy link
Contributor Author

@xing-yang xing-yang Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External-provisioner already keeps retrying PVC provisioning. See logs below.

I1125 17:27:23.655056 1 controller.go:1082] Temporary error received, adding PVC 2844879f-0f2f-4d6d-bf0d-e02ae676b8e7 to claims in progress”.

Here the CSI driver returns a non-final error, the PVC is added to “Claims in Progress” to be retried even if PVC itself is deleted.

This check for finalizer is just to differentiate between a new provisioning request vs a provisioning retry. If it is a new provisioning request (the finalizer is not added yet), the external provisioner will fail the request and won't call the CSI driver. If it is a retry (finalizer is already added), the external provisioner will call CSI driver to retry. This allows the CSI driver to complete the volume provisioning and return success. After that, the volume will be deleted if the deletionTimestamp was added to PVC earlier.

@sunnylovestiramisu
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 2, 2025
@k8s-ci-robot k8s-ci-robot merged commit f1bc082 into kubernetes-csi:master Dec 2, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leaking resources when pending PVC is deleted while being created from snapshot that is also being deleted

4 participants