Skip to content

Ansible Operator: owner references of created jobs not matching the actual owner CR frequently #136

@gre9ory

Description

@gre9ory

Bug Report

What did you do?

Created Ansible Operator to create jobs based on custom CRs. When a CR appears, the playbook triggers the creation of a job (via ansible k8s module) and deletes the CR after job completion is detected.

What did you expect to see?

When a batch of CRs is detected and processed, it is (naturally) expected that a job, which is created during the playbook run for a particular CR, has an owner reference to exactly the CR for which the playbook was started.

What did you see instead? Under which circumstances?

The actual owner references are wrongly assigned about 50% of the time when multiple CRs are created in a short time frame. They seem to be randomly pointing to one of the CRs created in bulk. This appears to be a severe issue (unless I am doing something totally wrong?). I did not find any issue related to this when searching.

Example of two jobs that were created from two watched CRs. The CRs were created within 3 seconds of each other. Note that the job name is set equal to the CR name for which it was created. The name is a UID. While the actual job data is correctly derived from the CR, it is weirdly apparent that owner references are actually switched here, the first job has the second CR as owner, while the second job has the first CR as owner:

Job 1:

apiVersion: batch/v1
kind: Job
metadata:
  name: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
  namespace: tenant-d4af8bbf-dfa2-41d2-a91a-1f4092f0222a
  uid: 6242278d-9c18-4a1a-8655-4e3a360c9904
  resourceVersion: '24780744'
  generation: 1
  creationTimestamp: '2025-02-06T16:57:50Z'
  labels:
    optimization_id: 439964c7-6941-43b2-b2ff-3a8676eca868
    optimization_instance_id: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
  ownerReferences:
    - apiVersion: abc.xyz.com/v1alpha1
      kind: Optimization
      name: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
      uid: 3cfd7c28-098d-4553-83e4-140b37f73977
...

Job 2:

apiVersion: batch/v1
kind: Job
metadata:
  name: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
  namespace: tenant-d4af8bbf-dfa2-41d2-a91a-1f4092f0222a
  uid: a1a93678-9f55-4032-8dbd-f5fa5bdc0be0
  resourceVersion: '24780937'
  generation: 1
  creationTimestamp: '2025-02-06T16:57:44Z'
  labels:
    optimization_id: bf1178cf-5788-43a4-98fe-c422705a037c
    optimization_instance_id: bf1178cf-5788-43a4-98fe-c422705a037c-20250502173042
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
  ownerReferences:
    - apiVersion: abc.xyz.com/v1alpha1
      kind: Optimization
      name: 439964c7-6941-43b2-b2ff-3a8676eca868-20250502173039
      uid: 86e8ca4f-c286-4f81-a0e3-5be500ea9deb
...

I observed the assignment of job ownership to CRs to be anything of the following:

  • the assignments may be switched around like above
  • all three jobs may be marked as owned by one CR
  • the ownership may be correctly assigned

From several tests, job ownership assignment to CRs seems to be rather undeterministic behavior for CRs created in a short time frame (< 5 seconds).

Environment

Kubernetes cluster type:

DigitalOcean DOKS with k8s 1.31

$ operator-sdk version

quay.io/operator-framework/ansible-operator:v1.37.1

$ kubectl version

1.31

Possible Solution

It seems as if there is no clear back reference from playbook being executed to the CR that triggered it? Seems like when the job is being created it gets assigned one owner which may be currently "active" in another thread or similar?

A possible workaround may be to manually assign the ownerships in the playbook, assuming that when watchDependentResources: false there are no ownerships automatically injected?

Additional context

watches.yml:

---
- version: v1alpha1
  group: abc.xyz.com
  kind: Optimization
  playbook: /opt/ansible/playbook.yml
  reconcilePeriod: "10s"
  watchDependentResources: true
  manageStatus: true

Thanks for looking into this, I feel this is a quite critical bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions