Skip to content

Conversation

@wprzytula
Copy link
Collaborator

Historically, when making changes to the code we would forget to send something to the Pager Worker channel before closing it. This would cause the recv().unwrap() in QueryPager to panic, because the channel would be closed without having sent anything.
To harden against such logic bugs, we added an abstraction of PageSendAttemptedProof, which enforces that at least one item is sent to the channel before it is closed, or else worker.work() can't return. With this abstraction in place, we long believed the unwrap() to be safe.

However, there is one more case when recv() can return None: when the runtime is being shut down. In that case, the task polling the channel is cancelled, and the channel is closed without sending anything.

This commit handles this case by returning an empty page when recv() returns None. This allows for graceful handling of runtime shutdown scenarios, without panicking.
We are sure that we won't introduce silent errors this way, because if we get None, the only possible explanation is that the runtime is indeed being shut down. The logic bugs on the side of the Pager Worker are already handled by the PageSendAttemptedProof abstraction.

Alternatively, we could return a new NextPageError variant indicating that the runtime is being shut down, but that seems like an overkill - this variant would be very special and pollute the API. Returning an empty page should be sufficient, as if someone is trying to shut down the runtime, they likely don't care about the results anyway.

Fixes: #1435

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • [ ] I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula self-assigned this Oct 25, 2025
@wprzytula wprzytula added bug Something isn't working area/statement-execution labels Oct 25, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a panic in the QueryPager when the Tokio runtime is shut down by replacing unwrap() with unwrap_or_else() to handle the case where the channel receiver returns None during runtime shutdown.

Key Changes:

  • Modified channel receive handling to return an empty page instead of panicking when the runtime shuts down
  • Added detailed inline comments explaining the rationale for this approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wprzytula wprzytula added this to the 1.4.0 milestone Oct 25, 2025
@github-actions
Copy link

github-actions bot commented Oct 25, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: cf40086

Comment on lines 1004 to 1025
let page_received = receiver.recv().await.unwrap_or_else(|| {
// - The future returned by worker.work sends at least one item
// to the channel (the PageSendAttemptedProof helps enforce this);
// - That future is polled in a tokio::task which isn't going to be
// canceled, **unless** the runtime is being shut down.
// Therefore, the only reason for recv() to return None is
// that the runtime is being shut down. In that case,
// we return an empty page to allow for graceful handling
// of the situation.
//
// Alternatively, we could return a new NextPageError variant
// indicating that the runtime is being shut down, but that
// seems like an overkill - this variant would be very special
// and pollute the API. Returning an empty page should be sufficient,
// as if someone is trying to shut down the runtime, they probably
// don't care about the results anyway.
Ok(ReceivedPage {
rows: RawMetadataAndRawRows::mock_empty(),
tracing_id: None,
request_coordinator: None,
})
})?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this solution. First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.
Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.

What I propose:

  1. If we receive None, await the spawned task.
  2. Check is_panic, and if it returns true, propagate it.
  3. If it is not a panic, then it is shutdown (and is_cancelled will return true).
  4. Add an util like assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.

Copy link
Collaborator Author

@wprzytula wprzytula Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.

I wasn't aware of that.

Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.

I can agree, but please note that it's actually the same with two other cases when we return an artificial empty page:

  1. if issued a modification statement and got non-Rows Result in response,
  2. if RetryPolicy decided to ignore the error.

Don't you think these cases involve the same problem that you claim here?

What I propose:

  1. If we receive None, await the spawned task.

  2. Check is_panic, and if it returns true, propagate it.

  3. If it is not a panic, then it is shutdown (and is_cancelled will return true).

  4. Add an util like assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.

I'll try to implement this idea.

@wprzytula
Copy link
Collaborator Author

Rebased on main.

Historically, when making changes to the code we would forget to send
something to the Pager Worker channel before closing it. This would
cause the `recv().unwrap()` in QueryPager to panic, because the channel
would be closed without having sent anything.
To harden against such logic bugs, we added an abstraction of
PageSendAttemptedProof, which enforces that at least one item is sent
to the channel before it is closed, or else `worker.work()` can't
return. With this abstraction in place, we long believed the `unwrap()`
to be safe.

However, there is two more cases when `recv()` can return None:
1) when the runtime is being shut down;
2) when the worker task which is owner of the channel's sending part
   panics.
In both cases, the worker task terminates, and the channel is closed
without sending anything.

This commit handles both cases by executing special recovery logic
when `recv()` returns None. This allows for graceful handling of runtime
shutdown scenarios, without panicking, as well as correct panic
propagation.
We are sure that we won't introduce silent errors this way, because
if we get None, the only possible explanation is that the runtime
is indeed being shut down or that the worker task panicked. The logic
bugs on the side of the Pager Worker are already prevented by the
PageSendAttemptedProof abstraction.

If panic is detected, it is propagated using `std::panic::resume_unwind`.
If runtime shutdown is detected, we await a never-ending future
to avoid returning from this function while the runtime is being
shut down. To help debugging, we also emit a tracing info-level message
in this case.

Fixes: scylladb#1435
@wprzytula
Copy link
Collaborator Author

Implemented @Lorak-mmk's idea to handle both panics and runtime shutdown.

@wprzytula wprzytula requested a review from Lorak-mmk October 29, 2025 06:58
@wprzytula wprzytula merged commit 0d56f2b into scylladb:main Oct 29, 2025
12 checks passed
@wprzytula wprzytula deleted the fix-pager-panic branch October 29, 2025 15:00
@Lorak-mmk Lorak-mmk mentioned this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/statement-execution bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QueryPager panics due to no page send attempted, despite expected type-level guarantees

2 participants