Skip to content

Conversation

@Zakaria-Kofiro
Copy link
Collaborator

@Zakaria-Kofiro Zakaria-Kofiro commented Dec 19, 2025

Introduce VMStatus.replaced for AgentWatchdog-replaced agents

When the AgentWatchdog replaced failed agents, they were previously marked as terminated and treated the same as intentionally stopped agents. This caused broken bulk Kill/Stop commands (attempting to reach non-existent agents), incorrect job status calculations (jobs stuck in "Starting" or prematurely marked "Completed"), and inflated agent counts in the UI (e.g., "50/54" instead of "50/50").

The new VMStatus.replaced state keeps failed agents visible for debugging while explicitly excluding them from active job counts and command execution targets.

This PR also fixes a bug where replacement agents were initialized with VMStatus.pending instead of VMStatus.starting, causing the watchdog to wait indefinitely for them to report. Replacement agents now correctly start in starting status and transition normally when they call /v2/agent/ready.

Additionally, public/private IP addresses are now captured and logged during instance creation to aid network connectivity debugging when agents fail to come up.

Please make sure these check boxes are checked before submitting

  • ** Squashed Commits **
  • ** All Tests Passed ** - mvn clean test -P default

** PR review process **

  • Requires one +1 from a reviewer
  • Repository owners will merge your PR once it is approved.

@Zakaria-Kofiro Zakaria-Kofiro changed the title fix: prevent job stuck at Starting when AgentWatchdog replaces failed… fix: prevent job stuck at Starting status when AgentWatchdog replaces failed agents Dec 19, 2025
@Zakaria-Kofiro Zakaria-Kofiro changed the title fix: prevent job stuck at Starting status when AgentWatchdog replaces failed agents FIX: Prevent job stuck at Starting status when AgentWatchdog replaces failed agents Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants