Skip to content

Build: build process polling healthcheck #11870

@agjohnson

Description

@agjohnson

When a build process goes unresponsive, we currently wait a period of time before calling the build timed out. This protects against builds that have an excessive build time and are never going to finish, but it also protects against scaling group events that cause instances to suddenly disappear and also don't terminate builds.

However, when builds do suddenly terminate, we have to wait for hours for the builds to finally be marked as terminated. This is a not great UX, but also affects our scaling metrics for hours. In the case of a mass instance termination event, this can break our ASG scaling until the builds are terminated.

Instead of using a timeout approach, we could instead use a healthcheck poll in each build process:

  • Async process in each build task starts up and polls a build healthcheck API once per minute
  • If a build hasn't had one healthcheck for 5 minutes, the build process is likely dead
  • The build is marked as finished/aborted

There are probably some side effects to consider and plan around:

  • Healthcheck fails to post to the API but the build doesn't fail. This seems unlikely
  • CPU usage could delay a health check. This is maybe more likely, but a wider window should solve this
  • ??

This would reduce the timeout window from hours to a few minutes and would help avoid ghost builds from affecting scaling group scaling.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions