Skip to content

Conversation

@varundeepsaini
Copy link
Contributor

@varundeepsaini varundeepsaini commented Dec 15, 2025

Closes: #4140

Changes

Add KillCaller option to test server that terminates the CLI process when specific endpoints are hit. This enables testing crash recovery scenarios like bundle deploy interrupted mid-execution.

KillCaller specifies how many times to kill before allowing normal responses, enabling retry/recovery testing.

[[Server]]
Pattern = "GET /api/2.0/workspace/list"
KillCaller = 2  # Kill first 2 calls, then allow

Why

Enables testing CLI crash recovery scenarios (e.g., bundle deploy interrupted mid-execution).

Tests

Added acceptance/selftest/kill_caller/ tests.

@varundeepsaini varundeepsaini force-pushed the ft/server-crash-acceptance branch from 89e1011 to 3a3b6f5 Compare December 15, 2025 15:20
@varundeepsaini
Copy link
Contributor Author

@denik could you please review this pr

@varundeepsaini varundeepsaini force-pushed the ft/server-crash-acceptance branch 2 times, most recently from debc6c8 to bcc288c Compare December 17, 2025 08:02
@varundeepsaini varundeepsaini force-pushed the ft/server-crash-acceptance branch from bcc288c to ba053c3 Compare December 17, 2025 08:04
@varundeepsaini
Copy link
Contributor Author

varundeepsaini commented Dec 17, 2025

@denik can you please queue the integration tests

@varundeepsaini
Copy link
Contributor Author

@denik bump ^^

@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Dec 30, 2025

Commit: 1ad08e2

Run: 20655194881

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 9 8 2 381 663 23:21
🟨​ aws windows 14 3 2 383 661 24:34
🟨​ aws-ucws linux 5 12 2 532 540 39:35
🟨​ aws-ucws windows 10 7 2 534 538 37:47
🟨​ azure linux 4 2 6 3 381 662 91:35
🟨​ azure windows 9 2 1 3 383 660 92:17
🟨​ azure-ucws linux 7 5 3 528 539 241:56
🟨​ azure-ucws windows 12 3 530 537 224:00
🟨​ gcp linux 7 5 3 370 668 31:32
🟨​ gcp windows 12 3 372 666 27:40
19 interesting tests: 18 KNOWN, 1 SKIP
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestAccept/bundle/deployment/bind/alert 🙈​S 🙈​S 🙈​S 🙈​S 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestAccept/bundle/resources/alerts/basic 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestAccept/bundle/resources/alerts/basic/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R 🔄​f 🔄​f 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestAccept/bundle/resources/alerts/basic/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R 🔄​f 🔄​f 🟨​K 🟨​K 🟨​K 🟨​K
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 🟨​K 🟨​K 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestExport 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestExportWithFileFlag 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
🟨​ TestImportDir 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K
🟨​ TestImportDirDoesNotOverwrite 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K
🟨​ TestImportDirWithOverwriteFlag 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K
🟨​ TestImportFileFormatAuto 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K
🟨​ TestImportFileFormatSource 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K 💚​R 🟨​K
Top 28 slowest tests (at least 2 minutes):
duration env testname
10:06 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
8:01 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
7:17 aws-ucws windows TestAccept/bundle/resources/synced_database_tables/basic
6:21 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:40 aws-ucws linux TestAccept/bundle/resources/synced_database_tables/basic
5:29 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:25 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:24 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:22 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:19 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:10 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
4:38 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
4:18 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:50 azure-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:48 azure-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:44 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:33 azure-ucws windows TestAccept/bundle/resources/synced_database_tables/basic
2:59 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:43 azure-ucws linux TestAccept/bundle/resources/synced_database_tables/basic
2:35 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
2:32 aws-ucws windows TestAccept/bundle/resources/registered_models/basic/DATABRICKS_BUNDLE_ENGINE=terraform
2:28 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:16 aws-ucws linux TestAccept/bundle/templates/default-python/combinations/classic/DATABRICKS_BUNDLE_ENGINE=direct/DLT=yes/NBOOK=no/PY=no/READPLAN=
2:11 azure-ucws windows TestAccept/bundle/resources/registered_models/basic/DATABRICKS_BUNDLE_ENGINE=terraform
2:10 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:09 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
2:06 aws linux TestAccept/bundle/templates/default-python/combinations/classic/DATABRICKS_BUNDLE_ENGINE=terraform/DLT=no/NBOOK=yes/PY=no/READPLAN=

Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
@varundeepsaini
Copy link
Contributor Author

@denik
I changed KillCaller from bool to int to support retry scenarios. Now, we can specify how many times to kill before allowing requests through (e.g., KillCaller = 3 kills the first 3 attempts, 4th succeeds).

Had a few questions,

  1. Should I add thread safety (via atomics)?
  2. Would a binary string be useful for granular control (e.g., "1011" = kill, allow, kill, kill)?

Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com>
@denik
Copy link
Contributor

denik commented Jan 2, 2026

@denik I changed KillCaller from bool to int to support retry scenarios. Now, we can specify how many times to kill before allowing requests through (e.g., KillCaller = 3 kills the first 3 attempts, 4th succeeds).

That's a good idea.

  1. Should I add thread safety (via atomics)?

yes. simple mutex would work.

  1. Would a binary string be useful for granular control (e.g., "1011" = kill, allow, kill, kill)?
    I think we can start with the counter and see how far we can take it.

I was thinking of more general request matcher that can match requests based on payload substring. Then we won't need predefined sequence and also be more reliable in case of unexpected retries.

@varundeepsaini
Copy link
Contributor Author

@denik I can add another field for the regex string, and kill all the requests which have the body matching that regex does that work with what idea you had ? (i would still like to keep the int field as well, for easier use)

@denik
Copy link
Contributor

denik commented Jan 2, 2026

@denik I can add another field for the regex string, and kill all the requests which have the body matching that regex does that work with what idea you had ? (i would still like to keep the int field as well, for easier use)

In this PR, let's keep your counter approach. Body matching is just an idea for a separate feature.

@github-actions
Copy link

github-actions bot commented Jan 2, 2026

An authorized user can trigger integration tests manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 4146
  • Commit SHA: e0d41502877f7abdccdf450473ed24d9a5f37b53

Checks will be approved automatically on success.

@varundeepsaini varundeepsaini requested a review from denik January 2, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ability to kill CLI process from test server to test crash recovery scenarios

3 participants