-
Notifications
You must be signed in to change notification settings - Fork 419
Description
Describe the issue
When deploying Kafka-connect with connectors (2 worker nodes) in Oauth mode into https REST endpoint the deployment fails to missing certificates. This happens due to reason that run_once
with serial: 1
doesn't work together and current role tries to rerun connector configuration updates in worker-0 which has cleaned its temporal certs as its in separate serial batch.
To be more precise the issue happens due to the fact that with our inventory it triggers tasks branch of Register connector configs and remove deleted connectors for Multiple Clusters
because it interprets connect cluster parent groups as subgroups and or some dynamic groupings. This task has flag delegate_to
and thus runs in our case in worker that has been cleared for the certs that the update REST call expects to be available.
To Reproduce
Steps to reproduce the behaviour:
- we run
ansible-playbook -i hosts.yml --extra-vars "@extra-vars.yaml --limit kafka_connect confluent.platform.all"
Expected behaviour
It expected that kafka-connect role would register and deploy connectors once per connect-cluster in all deployment modes (parallel/serial).
Inventory File
Here is relevant bits (i.e. structure),
all:
hosts:
bastion:
...
children:
behind_bastion:
vars:
....
children:
core:
children:
...
kafka_connect:
hosts:
kafka-connect-0:
ansible_host: "{{ IP_connect0 }}"
kafka-connect-1:
ansible_host: "{{ IP_connect1 }}"
...
Logs
Relevant bits of the logs,
...
# successful first node deployment of configs on worker-0
TASK [confluent.platform.common : Get Authorization Token] *********************
ok: [kafka-connect-0]
TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for single cluster] ***
skipping: [kafka-connect-0]
TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for Multiple Clusters] ***
skipping: [kafka-connect-0] => (item=behind_bastion)
changed: [kafka-connect-0] => (item=kafka_connect)
changed: [kafka-connect-0] => (item=kafka_connect_serial)
skipping: [kafka-connect-0] => (item=core)
TASK [confluent.platform.kafka_connect : Delete temporary keys/certs when keystore and trustore is provided] ***
changed: [kafka-connect-0] => (item=/var/ssl/private/ca.crt)
changed: [kafka-connect-0] => (item=/var/ssl/private/kafka_connect.crt)
changed: [kafka-connect-0] => (item=/var/ssl/private/kafka_connect.key)
TASK [Proceed Prompt] **********************************************************
skipping: [kafka-connect-0]
PLAY [Kafka Connect Serial Provisioning] ***************************************
TASK [confluent.platform.variables : Ensure old and new mTLS variables are consistent] ***
included: /home/runner/.ansible/collections/ansible_collections/confluent/platform/roles/variables/tasks/mtls.yml for kafka-connect-1
TASK [confluent.platform.variables : Define component SSL variable pairs] ******
ok: [kafka-connect-1]
....
# failure in worker-1
TASK [confluent.platform.common : Get Authorization Token] *********************
ok: [kafka-connect-1]
TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for single cluster] ***
skipping: [kafka-connect-1]
TASK [confluent.platform.kafka_connect : Register connector configs and remove deleted connectors for Multiple Clusters] ***
skipping: [kafka-connect-1] => (item=behind_bastion)
failed: [kafka-connect-1 -> kafka-connect-0(*** IP_connect0 ***)] (item=kafka_connect) => ***"ansible_loop_var": "item", "changed": false, "item": "kafka_connect", "message": "[Errno 2] No such file or directory", "msg": "An error occurred while running the module"***
failed: [kafka-connect-1 -> kafka-connect-0(*** IP_connect0 ***)] (item=kafka_connect_serial) => ***"ansible_loop_var": "item", "changed": false, "item": "kafka_connect_serial", "message": "[Errno 2] No such file or directory", "msg": "An error occurred while running the module"***
skipping: [kafka-connect-1] => (item=core)
NO MORE HOSTS LEFT *************************************************************
And the failure [Errno 2] No such file or directory
is due to fact that Delete temporary keys/certs when keystore and trustore is provided
has already ran in worker-0.
Environment (please complete the following information):
- OS:
5.4.0-198-generic #218-Ubuntu
- CP-Ansible Branch:
7.9.2-post
- Ansible Version:
9.13.0
Additional context
I'm planning to do PR draft proposal to fix the issue. We have validated that by removing connect subgroup functionality, the playbook/role works. I would be keen to understand the subgroup functionality, so that I can try to re-implement it back.