Skip to content

[flink/ci] Add watchdog for flink timeout failure stack trace at 95 percent time completion #1280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MehulBatra
Copy link
Contributor

@MehulBatra MehulBatra commented Jul 6, 2025

Purpose

Linked issue: close #1031

Add watchdog monitoring for Flink tests to capture debugging information when CI builds timeout, helping identify root causes of stuck test processes.
Brief change log

  • Added tools/ci/flink_watchdog.sh: New watchdog script that monitors Flink test execution and captures thread dumps when tests approach timeout limits

  • Modified .github/workflows/ci.yaml: Updated CI workflow to use watchdog specifically for Flink module tests (matrix.module == 'flink')

  • Enhanced debugging capabilities: Watchdog captures Java process information and thread dumps (jstack) at 95% of timeout and just before process termination

  • Artifact collection: Added new CI step to upload debug artifacts (thread dumps, execution logs) when Flink tests fail due to timeout

This introduces a new CI debugging feature for Flink tests. Key aspects:

Debug artifacts: When Flink tests timeout, three files are created and uploaded:

  1. test-output: Complete timestamped execution log
  2. jps-traces.0: Thread dump captured at 95% of timeout (shows stuck processes)
  3. jps-traces.1: Final thread dump before process termination

Artifact download: Failed Flink CI runs will have downloadable flink-debug-{run-number} artifacts containing all debugging information

Tests

  • Local testing: Watchdog script tested locally with simulated timeout scenarios using sleep commands

API and Format

None

Documentation

None

@MehulBatra
Copy link
Contributor Author

@wuchong, I have tried to implement it in the most basic way to get started and kept it specifically for Flink tests. Please review whenever you get a chance. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Print and upload stacktraces before test is timeout
1 participant