Skip to content

BUG: Kafka connect connectors configurations fails to deploy in serial deployment mode #2160

@sami-airaksinen

Description

@sami-airaksinen

Describe the issue

When deploying Kafka-connect with connectors (2 worker nodes) in Oauth mode into https REST endpoint the deployment fails to missing certificates. This happens due to reason that run_once with serial: 1 doesn't work together and current role tries to rerun connector configuration updates in worker-0 which has cleaned its temporal certs as its in separate serial batch.

To be more precise the issue happens due to the fact that with our inventory it triggers tasks branch of Register connector configs and remove deleted connectors for Multiple Clusters because it interprets connect cluster parent groups as subgroups and or some dynamic groupings. This task has flag delegate_to and thus runs in our case in worker that has been cleared for the certs that the update REST call expects to be available.

To Reproduce

Steps to reproduce the behaviour:

  • we run ansible-playbook -i hosts.yml --extra-vars "@extra-vars.yaml --limit kafka_connect confluent.platform.all"

Expected behaviour

It expected that kafka-connect role would register and deploy connectors once per connect-cluster in all deployment modes (parallel/serial).

Inventory File

Here is relevant bits (i.e. structure),

all:
  hosts:
    bastion:
      ...
  children:
    behind_bastion:
      vars:
        ....
      children:
        core:
          children:
            ...
            kafka_connect:
              hosts:
                kafka-connect-0:
                  ansible_host: "{{ IP_connect0 }}"
                kafka-connect-1:
                  ansible_host: "{{ IP_connect1 }}"
             ...

Logs

Relevant bits of the logs,

...
# successful first node deployment of configs on worker-0

TASK [confluent.platform.common : Get Authorization Token] *********************
ok: [kafka-connect-0]

TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for single cluster] ***
skipping: [kafka-connect-0]

TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for Multiple Clusters] ***
skipping: [kafka-connect-0] => (item=behind_bastion) 
changed: [kafka-connect-0] => (item=kafka_connect)
changed: [kafka-connect-0] => (item=kafka_connect_serial)
skipping: [kafka-connect-0] => (item=core) 

TASK [confluent.platform.kafka_connect : Delete temporary keys/certs when keystore and trustore is provided] ***
changed: [kafka-connect-0] => (item=/var/ssl/private/ca.crt)
changed: [kafka-connect-0] => (item=/var/ssl/private/kafka_connect.crt)
changed: [kafka-connect-0] => (item=/var/ssl/private/kafka_connect.key)

TASK [Proceed Prompt] **********************************************************
skipping: [kafka-connect-0]

PLAY [Kafka Connect Serial Provisioning] ***************************************

TASK [confluent.platform.variables : Ensure old and new mTLS variables are consistent] ***
included: /home/runner/.ansible/collections/ansible_collections/confluent/platform/roles/variables/tasks/mtls.yml for kafka-connect-1

TASK [confluent.platform.variables : Define component SSL variable pairs] ******
ok: [kafka-connect-1]

....

# failure in worker-1

TASK [confluent.platform.common : Get Authorization Token] *********************
ok: [kafka-connect-1]

TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for single cluster] ***
skipping: [kafka-connect-1]

TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for Multiple Clusters] ***
skipping: [kafka-connect-1] => (item=behind_bastion) 
failed: [kafka-connect-1 -> kafka-connect-0(*** IP_connect0 ***)] (item=kafka_connect) => ***"ansible_loop_var": "item", "changed": false, "item": "kafka_connect", "message": "[Errno 2] No such file or directory", "msg": "An error occurred while running the module"***
failed: [kafka-connect-1 -> kafka-connect-0(*** IP_connect0 ***)] (item=kafka_connect_serial) => ***"ansible_loop_var": "item", "changed": false, "item": "kafka_connect_serial", "message": "[Errno 2] No such file or directory", "msg": "An error occurred while running the module"***
skipping: [kafka-connect-1] => (item=core) 

NO MORE HOSTS LEFT *************************************************************

And the failure [Errno 2] No such file or directory is due to fact that Delete temporary keys/certs when keystore and trustore is provided has already ran in worker-0.

Environment (please complete the following information):

  • OS: 5.4.0-198-generic #218-Ubuntu
  • CP-Ansible Branch: 7.9.2-post
  • Ansible Version: 9.13.0

Additional context

I'm planning to do PR draft proposal to fix the issue. We have validated that by removing connect subgroup functionality, the playbook/role works. I would be keen to understand the subgroup functionality, so that I can try to re-implement it back.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions