pager: fix panic when runtime is shut down #1456

wprzytula · 2025-10-25T07:36:57Z

Historically, when making changes to the code we would forget to send something to the Pager Worker channel before closing it. This would cause the recv().unwrap() in QueryPager to panic, because the channel would be closed without having sent anything.
To harden against such logic bugs, we added an abstraction of PageSendAttemptedProof, which enforces that at least one item is sent to the channel before it is closed, or else worker.work() can't return. With this abstraction in place, we long believed the unwrap() to be safe.

However, there is one more case when recv() can return None: when the runtime is being shut down. In that case, the task polling the channel is cancelled, and the channel is closed without sending anything.

This commit handles this case by returning an empty page when recv() returns None. This allows for graceful handling of runtime shutdown scenarios, without panicking.
We are sure that we won't introduce silent errors this way, because if we get None, the only possible explanation is that the runtime is indeed being shut down. The logic bugs on the side of the Pager Worker are already handled by the PageSendAttemptedProof abstraction.

Alternatively, we could return a new NextPageError variant indicating that the runtime is being shut down, but that seems like an overkill - this variant would be very special and pollute the API. Returning an empty page should be sufficient, as if someone is trying to shut down the runtime, they likely don't care about the results anyway.

Fixes: #1435

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
~~[ ] I added relevant tests for new features and bug fixes.~~
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
~~[ ] I have adjusted the documentation in ./docs/source/.~~
I added appropriate Fixes: annotations to PR description.

Copilot

Pull Request Overview

This PR fixes a panic in the QueryPager when the Tokio runtime is shut down by replacing unwrap() with unwrap_or_else() to handle the case where the channel receiver returns None during runtime shutdown.

Key Changes:

Modified channel receive handling to return an empty page instead of panicking when the runtime shuts down
Added detailed inline comments explaining the rationale for this approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scylla/src/client/pager.rs

github-actions · 2025-10-25T07:39:21Z

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: cf40086

Lorak-mmk · 2025-10-28T14:26:32Z

scylla/src/client/pager.rs

+        let page_received = receiver.recv().await.unwrap_or_else(|| {
+            // - The future returned by worker.work sends at least one item
+            //   to the channel (the PageSendAttemptedProof helps enforce this);
+            // - That future is polled in a tokio::task which isn't going to be
+            //   canceled, **unless** the runtime is being shut down.
+            // Therefore, the only reason for recv() to return None is
+            // that the runtime is being shut down. In that case,
+            // we return an empty page to allow for graceful handling
+            // of the situation.
+            //
+            // Alternatively, we could return a new NextPageError variant
+            // indicating that the runtime is being shut down, but that
+            // seems like an overkill - this variant would be very special
+            // and pollute the API. Returning an empty page should be sufficient,
+            // as if someone is trying to shut down the runtime, they probably
+            // don't care about the results anyway.
+            Ok(ReceivedPage {
+                rows: RawMetadataAndRawRows::mock_empty(),
+                tracing_id: None,
+                request_coordinator: None,
+            })
+        })?;


I don't like this solution. First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.
Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.

What I propose:

If we receive None, await the spawned task.

Check is_panic, and if it returns true, propagate it.

If it is not a panic, then it is shutdown (and is_cancelled will return true).

Add an util like assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.

First, runtime shutdown is not the only way this can happen - the spawned task could also panic, in which case the correct behavior is to propagate the panic.

I wasn't aware of that.

Second, this seems like hiding the issue, and also seems really brittle (we propagate an artificial value instead of handling the situation more explicitly). Right now we don't typecheck empty pages - but if we start at some point, it could lead to unexpected errors.

I can agree, but please note that it's actually the same with two other cases when we return an artificial empty page:

if issued a modification statement and got non-Rows Result in response,

if RetryPolicy decided to ignore the error.

Don't you think these cases involve the same problem that you claim here?

What I propose:

If we receive None, await the spawned task.

Check is_panic, and if it returns true, propagate it.

If it is not a panic, then it is shutdown (and is_cancelled will return true).

Add an util like assert_runtime_shutdown, which will for example await a never future (after printing a deubg / trace message). Use it in the above case.

I'll try to implement this idea.

wprzytula · 2025-10-28T20:52:58Z

Rebased on main.

Historically, when making changes to the code we would forget to send something to the Pager Worker channel before closing it. This would cause the `recv().unwrap()` in QueryPager to panic, because the channel would be closed without having sent anything. To harden against such logic bugs, we added an abstraction of PageSendAttemptedProof, which enforces that at least one item is sent to the channel before it is closed, or else `worker.work()` can't return. With this abstraction in place, we long believed the `unwrap()` to be safe. However, there is two more cases when `recv()` can return None: 1) when the runtime is being shut down; 2) when the worker task which is owner of the channel's sending part panics. In both cases, the worker task terminates, and the channel is closed without sending anything. This commit handles both cases by executing special recovery logic when `recv()` returns None. This allows for graceful handling of runtime shutdown scenarios, without panicking, as well as correct panic propagation. We are sure that we won't introduce silent errors this way, because if we get None, the only possible explanation is that the runtime is indeed being shut down or that the worker task panicked. The logic bugs on the side of the Pager Worker are already prevented by the PageSendAttemptedProof abstraction. If panic is detected, it is propagated using `std::panic::resume_unwind`. If runtime shutdown is detected, we await a never-ending future to avoid returning from this function while the runtime is being shut down. To help debugging, we also emit a tracing info-level message in this case. Fixes: scylladb#1435

wprzytula · 2025-10-29T06:58:19Z

Implemented @Lorak-mmk's idea to handle both panics and runtime shutdown.

wprzytula requested review from Lorak-mmk and Copilot October 25, 2025 07:37

wprzytula self-assigned this Oct 25, 2025

wprzytula added bug Something isn't working area/statement-execution labels Oct 25, 2025

Copilot AI reviewed Oct 25, 2025

View reviewed changes

scylla/src/client/pager.rs Outdated Show resolved Hide resolved

wprzytula added this to the 1.4.0 milestone Oct 25, 2025

Lorak-mmk requested changes Oct 28, 2025

View reviewed changes

wprzytula force-pushed the fix-pager-panic branch from dfdc71c to 56dcc28 Compare October 28, 2025 20:51

wprzytula force-pushed the fix-pager-panic branch from 56dcc28 to cf40086 Compare October 29, 2025 06:57

wprzytula requested a review from Lorak-mmk October 29, 2025 06:58

Lorak-mmk approved these changes Oct 29, 2025

View reviewed changes

wprzytula merged commit 0d56f2b into scylladb:main Oct 29, 2025
12 checks passed

wprzytula deleted the fix-pager-panic branch October 29, 2025 15:00

Lorak-mmk mentioned this pull request Nov 6, 2025

Release 1.4.0 #1467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pager: fix panic when runtime is shut down #1456

pager: fix panic when runtime is shut down #1456

Uh oh!

wprzytula commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions bot commented Oct 25, 2025 •

edited

Loading

Uh oh!

Lorak-mmk Oct 28, 2025

Uh oh!

wprzytula Oct 28, 2025 •

edited

Loading

Uh oh!

wprzytula commented Oct 28, 2025

Uh oh!

wprzytula commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pager: fix panic when runtime is shut down #1456

pager: fix panic when runtime is shut down #1456

Uh oh!

Conversation

wprzytula commented Oct 25, 2025

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

github-actions bot commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

wprzytula Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Oct 28, 2025

Uh oh!

wprzytula commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 25, 2025 •

edited

Loading

wprzytula Oct 28, 2025 •

edited

Loading