Skip to content

Conversation

georgesittas
Copy link
Contributor

@georgesittas georgesittas commented Jul 16, 2025

This PR aims to address the following problem: suppose a plan is submitted, resulting in a BigQuery job submission to evaluate a model, followed by a keyboard interrupt issued by the user (e.g., CTRL-C / SIGINT is sent from the terminal).

Today we don't cancel the job, meaning that long-running jobs will be left orphaned to run in BigQuery, despite the plan being cancelled. This may result in needless computation (i.e., cost), and so in this PR we intercept the keyboard interrupt keep track of pending jobs and send a cancellation request upon closing the adapter, unless the jobs are already completed by that point.

@georgesittas georgesittas requested a review from a team July 16, 2025 18:04
@georgesittas georgesittas marked this pull request as draft July 16, 2025 18:16
@georgesittas
Copy link
Contributor Author

Discussed internally and concluded there may be a better way to deal with this: aggregate initiated jobs in a set attribute, and upon close go over them and see if there are unfinished jobs. If there are, issue cancel requests for them. This way we only need to care about calling close downstream.

@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch from c2b6739 to 2a5b223 Compare July 18, 2025 11:23
@georgesittas georgesittas marked this pull request as ready for review July 18, 2025 11:23
@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch from bbc101a to 55a568d Compare July 22, 2025 15:01
@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch from 1083eed to f1809d8 Compare July 25, 2025 13:39
@georgesittas georgesittas requested a review from izeigerman July 25, 2025 13:39
@georgesittas
Copy link
Contributor Author

@izeigerman addressed the latest feedback round here:

  • Got rid of the _query_jobs property
  • Added proper type hint for the _query_job property

@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch 2 times, most recently from 8a0dd98 to 3e443a9 Compare July 28, 2025 10:39
@plaflamme
Copy link
Contributor

FMI: will this also be executed upon SIGTERM? We've had some cases of a sqlmesh GH bot being interrupted by Github Actions (which sends SIGTERM) that did not properly terminate the sqlmesh process. We'd like for GH actions cancellations to also gracefully stop the sqlmesh process and all BQ queries.

@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch from 3e443a9 to 76cd59a Compare August 11, 2025 08:57
@georgesittas georgesittas force-pushed the jo/bigquery_job_cancellation branch from 76cd59a to 88a3501 Compare August 12, 2025 14:18
@georgesittas
Copy link
Contributor Author

Hey @plaflamme 👋! Apologies for the delay, just got back from vacation.

FMI: will this also be executed upon SIGTERM? We've had some cases of a sqlmesh GH bot being interrupted by Github Actions (which sends SIGTERM) that did not properly terminate the sqlmesh process. We'd like for GH actions cancellations to also gracefully stop the sqlmesh process and all BQ queries.

I believe so:

  1. The default error handler always calls Context.close() upon termination of the invoked click sub-command (source)
  2. The SIGTERM signal terminates the process and hence invokes that close() call
  3. In turn, that leads to SnapshotEvaluator.close(), which closes all adapters held by the evaluator
  4. That leads to BigQueryEngineAdapter.close() being called, resulting in all pending jobs being cleaned up

Does this answer your question?

@georgesittas georgesittas requested a review from a team August 12, 2025 15:38
@plaflamme
Copy link
Contributor

@georgesittas Thanks for the reply!

The SIGTERM signal terminates the process and hence invokes that close() call

I'm not a python expert, but normally, this is only true if the application actually handles the signal. Perhaps that's already in place?

@georgesittas
Copy link
Contributor Author

Hmm, I'm actually not sure if the finally will be reached, maybe the signal needs to be caught in order to raise an exception instead of terminating the process. I'll double-check.

Perhaps that's already in place?

I don't think we register a handler for SIGTERM today.

@georgesittas
Copy link
Contributor Author

Ok yeah, it's not reached... Quick script to reproduce:

#!/usr/bin/env python3
"""
Usage:
1. Run this script: python sigterm_demo.py
2. In another terminal, find the PID printed by the script
3. Send SIGTERM: kill -TERM <PID>
"""

import os
import time

pid = os.getpid()
print(f"Process started with PID: {pid}")
print("Sleeping for 60 seconds...")

try:
    for i in range(60):
        print(f"Still running... {i+1}/60 seconds", end='\r')
        time.sleep(1)
finally:
    print("Reached!")

print("\nProcess completed normally (wasn't terminated)")

@izeigerman any thoughts re: handling SIGTERM? I wonder whether this is something we should do, or if a user should account for it. Perhaps we could register a handler for SIGTERM only for the CICD bot?

@plaflamme
Copy link
Contributor

@georgesittas

or if a user should account for it

FWIW: if an SCD2 query is running when SIGTERM occurs and is not cancelled, it's (currently at least) not trivial for the user to handle this since the table is mutated, but the interval not added. This results in "corruption" that is relatively difficult to remove manually; until SQLMesh correctly handles re-computing the latest interval for SCD2 models, it's pretty important to cancel outstanding queries.

@georgesittas
Copy link
Contributor Author

Perhaps we could register a SIGTERM handler that calls sys.exit(1) or something. That raises a SystemExit exception, signaling an intention to exit the interpreter (docs). That way, finally will be reached.

@georgesittas georgesittas merged commit d177625 into main Aug 13, 2025
25 of 27 checks passed
@georgesittas georgesittas deleted the jo/bigquery_job_cancellation branch August 13, 2025 07:19
@georgesittas
Copy link
Contributor Author

@plaflamme FYI I merged this without dealing with SIGTERM for the time being, since the two were orthogonal. It's on my list to tackle this soon, will let you know when I make some progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants