-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Bug Report
What did you do?
Created Ansible Operator to create jobs based on custom CRs. When a CR appears, the playbook triggers the creation of a job (via ansible k8s module) and deletes the CR after job completion is detected.
What did you expect to see?
When a batch of CRs is detected and processed, it is (naturally) expected that a job, which is created during the playbook run for a particular CR, has an owner reference to exactly the CR for which the playbook was started.
What did you see instead? Under which circumstances?
The actual owner references are wrongly assigned about 50% of the time when multiple CRs are created in a short time frame. They seem to be randomly pointing to one of the CRs created in bulk. This appears to be a severe issue (unless I am doing something totally wrong?). I did not find any issue related to this when searching.
Example of two jobs that were created from two watched CRs. The CRs were created within 3 seconds of each other. Note that the job name is set equal to the CR name for which it was created. The name is a UID. While the actual job data is correctly derived from the CR, it is weirdly apparent that owner references are actually switched here, the first job has the second CR as owner, while the second job has the first CR as owner:
Job 1:
apiVersion: batch/v1
kind: Job
metadata:
name: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
namespace: tenant-d4af8bbf-dfa2-41d2-a91a-1f4092f0222a
uid: 6242278d-9c18-4a1a-8655-4e3a360c9904
resourceVersion: '24780744'
generation: 1
creationTimestamp: '2025-02-06T16:57:50Z'
labels:
optimization_id: 439964c7-6941-43b2-b2ff-3a8676eca868
optimization_instance_id: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
ownerReferences:
- apiVersion: abc.xyz.com/v1alpha1
kind: Optimization
name: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
uid: 3cfd7c28-098d-4553-83e4-140b37f73977
...
Job 2:
apiVersion: batch/v1
kind: Job
metadata:
name: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
namespace: tenant-d4af8bbf-dfa2-41d2-a91a-1f4092f0222a
uid: a1a93678-9f55-4032-8dbd-f5fa5bdc0be0
resourceVersion: '24780937'
generation: 1
creationTimestamp: '2025-02-06T16:57:44Z'
labels:
optimization_id: bf1178cf-5788-43a4-98fe-c422705a037c
optimization_instance_id: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
ownerReferences:
- apiVersion: abc.xyz.com/v1alpha1
kind: Optimization
name: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
uid: 86e8ca4f-c286-4f81-a0e3-5be500ea9deb
...
I observed the assignment of job ownership to CRs to be anything of the following:
- the assignments may be switched around like above
- all three jobs may be marked as owned by one CR
- the ownership may be correctly assigned
From several tests, job ownership assignment to CRs seems to be rather undeterministic behavior for CRs created in a short time frame (< 5 seconds).
Environment
Kubernetes cluster type:
DigitalOcean DOKS with k8s 1.31
$ operator-sdk version
quay.io/operator-framework/ansible-operator:v1.37.1
$ kubectl version
1.31
Possible Solution
It seems as if there is no clear back reference from playbook being executed to the CR that triggered it? Seems like when the job is being created it gets assigned one owner which may be currently "active" in another thread or similar?
A possible workaround may be to manually assign the ownerships in the playbook, assuming that when watchDependentResources: false
there are no ownerships automatically injected?
Additional context
watches.yml
:
---
- version: v1alpha1
group: abc.xyz.com
kind: Optimization
playbook: /opt/ansible/playbook.yml
reconcilePeriod: "10s"
watchDependentResources: true
manageStatus: true
Thanks for looking into this, I feel this is a quite critical bug?