🌱 Implement dynamic TLS configuration for HTTP client #13064
Closed
+34
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Added a new dynamicTLSRoundtripper to allow for context-specific TLS configurations in HTTP requests. This prevents overwriting the global http.DefaultClient's TLS settings, addressing potential race conditions in concurrent calls. The httpClient now utilizes this new roundtripper to manage TLS settings effectively without creating a new http.Client every call and therefore still benefitting from http.Client optimization around connection reuse, etc.
Follow up from #13058 (comment).
Description from the original PR that this effectively would supersede:
The current TLS configuration was overriding the TLSConfig for the global http.DefaultClient. This call is being used by controllers such as the ExtensionConfig controller which calls this function from multiple concurrent workers. This leads to a race where the TLS ServerName is configured differently to that of the URL it is trying to call and X509 validation fails. An example can be seen from the CAPI logs below:
Notice how the URL and the expected hostname are swapped in each log indicating a race (TLSConfig being reconfigured in the middle of the call by different worker threads).
When this occurs the ExtensionConfigs become unreconciled and therefore ClusterClasses that reference them also become unreconcilable. Sorry I don't have resource histories showing this but the log I shared in the OP would trigger this.
We're hitting this with just 2 ExtensionConfigs and luckily for us so far this has recovered pretty quickly on requeue and thanks to jitter etc this does not materially affect cluster deployments. However, at scale with more ExtensionConfigs this race will be hit more often and could lead to worse outcomes.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #