-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
During node drain, every pgwire session’s context is canceled once server.shutdown.query_wait expires. In pkg/sql/pgwire/server.go (Server.serveImpl), we currently exit the client message loop only when:
- the network read returns a closed-connection error
- the returned err is (or maps to) context.Canceled
- repeatedErrorCount exceeds maxRepeatedErrorCount (currently 256, but 32 768 on older builds)
Note that case (2) doesn’t check the context directly. It depends on a called function to surface a context.Canceled error.
If a client keeps sending invalid SQL and we continue reading new requests successfully, the loop stays in the repeated-error path until maxRepeatedErrorCount is reached, even if the session’s context has already been canceled by drain. As a result, drain may log "some sessions did not respond to cancellation within 1s" and fail.
We should explicitly check for cancellation in ctx inside the loop. Two possible approaches:
- Check ctx.Err() at the end of each iteration and break immediately if non-nil.
- When incrementing repeatedErrorCount, also check ctx.Err() and exit early if canceled.
Jira issue: CRDB-57157