-
Notifications
You must be signed in to change notification settings - Fork 142
pager: fix panic when runtime is shut down #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a panic in the QueryPager when the Tokio runtime is shut down by replacing unwrap() with unwrap_or_else() to handle the case where the channel receiver returns None during runtime shutdown.
Key Changes:
- Modified channel receive handling to return an empty page instead of panicking when the runtime shuts down
- Added detailed inline comments explaining the rationale for this approach
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
scylla/src/client/pager.rs
Outdated
| let page_received = receiver.recv().await.unwrap_or_else(|| { | ||
| // - The future returned by worker.work sends at least one item | ||
| // to the channel (the PageSendAttemptedProof helps enforce this); | ||
| // - That future is polled in a tokio::task which isn't going to be | ||
| // canceled, **unless** the runtime is being shut down. | ||
| // Therefore, the only reason for recv() to return None is | ||
| // that the runtime is being shut down. In that case, | ||
| // we return an empty page to allow for graceful handling | ||
| // of the situation. | ||
| // | ||
| // Alternatively, we could return a new NextPageError variant | ||
| // indicating that the runtime is being shut down, but that | ||
| // seems like an overkill - this variant would be very special | ||
| // and pollute the API. Returning an empty page should be sufficient, | ||
| // as if someone is trying to shut down the runtime, they probably | ||
| // don't care about the results anyway. | ||
| Ok(ReceivedPage { | ||
| rows: RawMetadataAndRawRows::mock_empty(), | ||
| tracing_id: None, | ||
| request_coordinator: None, | ||
| }) | ||
| })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this solution. First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.
Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.
What I propose:
- If we receive None, await the spawned task.
- Check
is_panic, and if it returns true, propagate it. - If it is not a panic, then it is shutdown (and
is_cancelledwill return true). - Add an util like
assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.
I wasn't aware of that.
Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.
I can agree, but please note that it's actually the same with two other cases when we return an artificial empty page:
- if issued a modification statement and got non-Rows Result in response,
- if RetryPolicy decided to ignore the error.
Don't you think these cases involve the same problem that you claim here?
What I propose:
If we receive None, await the spawned task.
Check
is_panic, and if it returns true, propagate it.If it is not a panic, then it is shutdown (and
is_cancelledwill return true).Add an util like
assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.
I'll try to implement this idea.
dfdc71c to
56dcc28
Compare
|
Rebased on main. |
Historically, when making changes to the code we would forget to send something to the Pager Worker channel before closing it. This would cause the `recv().unwrap()` in QueryPager to panic, because the channel would be closed without having sent anything. To harden against such logic bugs, we added an abstraction of PageSendAttemptedProof, which enforces that at least one item is sent to the channel before it is closed, or else `worker.work()` can't return. With this abstraction in place, we long believed the `unwrap()` to be safe. However, there is two more cases when `recv()` can return None: 1) when the runtime is being shut down; 2) when the worker task which is owner of the channel's sending part panics. In both cases, the worker task terminates, and the channel is closed without sending anything. This commit handles both cases by executing special recovery logic when `recv()` returns None. This allows for graceful handling of runtime shutdown scenarios, without panicking, as well as correct panic propagation. We are sure that we won't introduce silent errors this way, because if we get None, the only possible explanation is that the runtime is indeed being shut down or that the worker task panicked. The logic bugs on the side of the Pager Worker are already prevented by the PageSendAttemptedProof abstraction. If panic is detected, it is propagated using `std::panic::resume_unwind`. If runtime shutdown is detected, we await a never-ending future to avoid returning from this function while the runtime is being shut down. To help debugging, we also emit a tracing info-level message in this case. Fixes: scylladb#1435
56dcc28 to
cf40086
Compare
|
Implemented @Lorak-mmk's idea to handle both panics and runtime shutdown. |
Historically, when making changes to the code we would forget to send something to the Pager Worker channel before closing it. This would cause the
recv().unwrap()in QueryPager to panic, because the channel would be closed without having sent anything.To harden against such logic bugs, we added an abstraction of PageSendAttemptedProof, which enforces that at least one item is sent to the channel before it is closed, or else
worker.work()can't return. With this abstraction in place, we long believed theunwrap()to be safe.However, there is one more case when
recv()can return None: when the runtime is being shut down. In that case, the task polling the channel is cancelled, and the channel is closed without sending anything.This commit handles this case by returning an empty page when
recv()returns None. This allows for graceful handling of runtime shutdown scenarios, without panicking.We are sure that we won't introduce silent errors this way, because if we get None, the only possible explanation is that the runtime is indeed being shut down. The logic bugs on the side of the Pager Worker are already handled by the PageSendAttemptedProof abstraction.
Alternatively, we could return a new NextPageError variant indicating that the runtime is being shut down, but that seems like an overkill - this variant would be very special and pollute the API. Returning an empty page should be sufficient, as if someone is trying to shut down the runtime, they likely don't care about the results anyway.
Fixes: #1435
Pre-review checklist
[ ] I added relevant tests for new features and bug fixes.[ ] I have provided docstrings for the public items that I want to introduce.[ ] I have adjusted the documentation in./docs/source/.Fixes:annotations to PR description.