Skip to content

Conversation

@twz123
Copy link

@twz123 twz123 commented Oct 21, 2025

Three goroutines could outlive a call to ClientConn.close(). Add mechanics to cancel them and wait for them to complete when closing a client connection.

Fixes #8655.

RELEASE NOTES:

  • Closing a client connection will cancel all pending goroutines and block until they complete.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.87%. Comparing base (647162c) to head (00deb21).

Files with missing lines Patch % Lines
clientconn.go 45.00% 10 Missing and 12 partials ⚠️
internal/testutils/pipe_listener.go 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8666      +/-   ##
==========================================
- Coverage   83.38%   82.87%   -0.52%     
==========================================
  Files         418      418              
  Lines       32367    32423      +56     
==========================================
- Hits        26988    26869     -119     
- Misses       4014     4048      +34     
- Partials     1365     1506     +141     
Files with missing lines Coverage Δ
balancer_wrapper.go 76.05% <100.00%> (-7.99%) ⬇️
internal/balancer/gracefulswitch/gracefulswitch.go 68.96% <100.00%> (-17.51%) ⬇️
internal/testutils/pipe_listener.go 54.76% <80.00%> (-29.03%) ⬇️
clientconn.go 79.35% <45.00%> (-11.16%) ⬇️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

healthData *healthData

shutdownMu sync.Mutex
shutdown chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to shutdownCh to indicate it is a channel?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


func (acbw *acBalancerWrapper) UpdateAddresses(addrs []resolver.Address) {
acbw.ac.updateAddrs(addrs)
acbw.goFunc(func(shutdown <-chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we changed this to run the acbw.ac.updateAddrs(addrs) function in a go routine??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically a "bubbled-up" goroutine. Previously, the goroutine was spawned in updateAddrs itself (line 1021). But, as we now need to track those, I figured it would be most appropriate to do it here. Another option would be to somehow push this down into updateAddrs itself, by passing the acBalancerWrapper pointer, or a function pointer to acbw.goFunc or sth. along those lines, and then use that to spawn the goroutine there:

Suggested change
acbw.goFunc(func(shutdown <-chan struct{}) {
acbw.ac.updateAddrs(acbw, addrs)

Then we could write line 1021 of updateAddrs like so:

	acbw.goFunc(ac.resetTransportAndUnlock)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it this way. Seems to be clearer.

ac.mu.Lock()
defer ac.mu.Unlock()
acMu := &ac.mu
acMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the original code did not work? this seems unnecessarily complicated and does the same thing. Or am I missing something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a way to make the defer do the unlock conditionally, as the code might need to unlock it before returning (see lines 1441 and 1442). We can achieve the same e.g. with a boolean variable, if you prefer.

I think using a pointer for this is good, because, if you forget the nil check, it will panic with an easy to understand stack trace. Whereas if you forget to check a boolean, you'd do a double-unlock and there's a chance that you get weirder problems.

@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from e45e9d7 to e8bb2b4 Compare November 5, 2025 12:28
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from e8bb2b4 to 3fa606c Compare November 20, 2025 08:10
@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@twz123
We had a discussion about this PR during our GitHub issues/PR scrub. Can I request you to split these into separate PRs. The change to the graceful_switch LB policy is much simpler compared to the changes in the clientconn for the other two goroutine leaks. We want to make progress on this as much as possible, but are also a little concerned about the complexity that it adds to the clientconn code (which is already reasonably complex).

Apologies for how long this review has dragged on, but hopefully once these are split into smaller ones, we should be able to move faster.

Thanks for your understanding.

Goroutines spawned during balancer swaps could outlive the call to
Balancer.Close(). Monitor these via a wait group and wait for them to
finish before returning from Close(). This prevents any noticeable side
effects that could otherwise occur after Close() returns.

RELEASE NOTES:
- Closing a graceful switch balancer will now block until all pending
  goroutines complete.

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
Three goroutines could outlive a call to ClientConn.close(). Add
mechanics to cancel them and wait for them to complete when closing a
client connection.

RELEASE NOTES:
- Closing a client connection will cancel all pending goroutines and
  block until they complete.

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
@twz123 twz123 force-pushed the clientconn-close-waitfor-goroutines branch from 3fa606c to 00deb21 Compare December 5, 2025 11:12
@twz123
Copy link
Author

twz123 commented Dec 5, 2025

@twz123 We had a discussion about this PR during our GitHub issues/PR scrub. Can I request you to split these into separate PRs. The change to the graceful_switch LB policy is much simpler compared to the changes in the clientconn for the other two goroutine leaks. We want to make progress on this as much as possible, but are also a little concerned about the complexity that it adds to the clientconn code (which is already reasonably complex).

Done: #8746

Apologies for how long this review has dragged on, but hopefully once these are split into smaller ones, we should be able to move faster.

Thanks for your understanding.

No worries. I understand that these things take time.

@twz123 twz123 marked this pull request as draft December 5, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow waiting for all goroutines to exit on client connection close

4 participants