Skip to content

Conversation

@jimmidyson
Copy link
Member

@jimmidyson jimmidyson commented Nov 27, 2025

What this PR does / why we need it:

Added a new dynamicTLSRoundtripper to allow for context-specific TLS configurations in HTTP requests. This prevents overwriting the global http.DefaultClient's TLS settings, addressing potential race conditions in concurrent calls. The httpClient now utilizes this new roundtripper to manage TLS settings effectively without creating a new http.Client every call and therefore still benefitting from http.Client optimization around connection reuse, etc.

Follow up from #13058 (comment).

Description from the original PR that this effectively would supersede:

The current TLS configuration was overriding the TLSConfig for the global http.DefaultClient. This call is being used by controllers such as the ExtensionConfig controller which calls this function from multiple concurrent workers. This leads to a race where the TLS ServerName is configured differently to that of the URL it is trying to call and X509 validation fails. An example can be seen from the CAPI logs below:

"Reconciler error" err="failed to discover ExtensionConfig extension-config-a: failed to discover extension \"extension-config-a\": http call failed: Post \"[https://extension-config-a-runtimehooks.extension-config-a-system.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovery?timeout=10s](https://extension-config-a-runtimehooks.extension-config-a-system.svc/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovery?timeout=10s)\": tls: failed to verify certificate: x509: certificate is valid for extension-config-a-runtimehooks.extension-config-a-system.svc, extension-config-a-runtimehooks.extension-config-a-system.svc.cluster.local, not extension-config-b.extension-config-b-system.svc" controller="extensionconfig" controllerGroup="runtime.cluster.x-k8s.io" controllerKind="ExtensionConfig" ExtensionConfig="extension-config-a" namespace="" name="extension-config-a" reconcileID="dfd00b69-3666-4818-b4a0-52eb1c391848"
"Reconciler error" err="failed to discover ExtensionConfig extension-config-b: failed to discover extension \"extension-config-b\": http call failed: Post \"[https://extension-config-b.extension-config-b-system.svc:443/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovery?timeout=10s](https://extension-config-b.extension-config-b-system.svc/hooks.runtime.cluster.x-k8s.io/v1alpha1/discovery?timeout=10s)\": tls: failed to verify certificate: x509: certificate is valid for extension-config-b.extension-config-b-system.svc, extension-config-b.extension-config-b-system.svc.cluster.local, not extension-config-a-runtimehooks.extension-config-a-system.svc" controller="extensionconfig" controllerGroup="runtime.cluster.x-k8s.io" controllerKind="ExtensionConfig" ExtensionConfig="extension-config-b" namespace="" name="extension-config-b" reconcileID="4cc93a96-cfcf-49f8-8276-b3725fc8e1b8"

Notice how the URL and the expected hostname are swapped in each log indicating a race (TLSConfig being reconfigured in the middle of the call by different worker threads).

When this occurs the ExtensionConfigs become unreconciled and therefore ClusterClasses that reference them also become unreconcilable. Sorry I don't have resource histories showing this but the log I shared in the OP would trigger this.

We're hitting this with just 2 ExtensionConfigs and luckily for us so far this has recovered pretty quickly on requeue and thanks to jitter etc this does not materially affect cluster deployments. However, at scale with more ExtensionConfigs this race will be hit more often and could lead to worse outcomes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Nov 27, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 27, 2025
@jimmidyson
Copy link
Member Author

/area clusterclass
/assign @fabriziopandini

@k8s-ci-robot k8s-ci-robot added area/clusterclass Issues or PRs related to clusterclass and removed do-not-merge/needs-area PR is missing an area label labels Nov 27, 2025
@jimmidyson jimmidyson force-pushed the jimmi/dynamic-transport-for-client branch from 10e52dc to 6b4fb4b Compare November 27, 2025 19:01
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from fabriziopandini. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fabriziopandini
Copy link
Member

/hold

while we think about the best solution for this issue

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2025
Added a new dynamicTLSRoundtripper to allow for context-specific TLS configurations in HTTP requests. This prevents overwriting the global http.DefaultClient's TLS settings, addressing potential race conditions in concurrent calls. The httpClient now utilizes this new roundtripper to manage TLS settings effectively without creating a new http.Client every call and therefore still benefitting from http.Client optimization around connection reuse, etc.
@jimmidyson jimmidyson force-pushed the jimmi/dynamic-transport-for-client branch from 6b4fb4b to dd0fc0d Compare November 28, 2025 10:37
@chrischdi
Copy link
Member

I had a chat with Fabrizio about this and I think we have a slightly different solution which caches and re-uses clients per target for some time.

However, thanks @jimmidyson for bringing up and reporting this bug!

We will follow-up on Fabrizio's PR then which should be up in a bit, and take care of cherry-picking that one too.

@fabriziopandini
Copy link
Member

@jimmidyson @chrischdi the new PR is up #13075

@jimmidyson jimmidyson closed this Nov 28, 2025
@jimmidyson jimmidyson deleted the jimmi/dynamic-transport-for-client branch November 28, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/clusterclass Issues or PRs related to clusterclass cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants