Builds: run a healthcheck command in the background #12332

humitos · 2025-07-17T14:24:31Z

Inside the Docker container that runs all the build process, we execute a simple curl command to hit our API at /build/<id>/healthcheck/ every RTD_BUILD_HEALTHCHECK_DELAY seconds to communicate the backend the build is healthy.

The backend runs a periodic task every 1 minute and check for those builds that haven't had activity in the last RTD_BUILD_HEALTHCHECK_TIMEOUT seconds and cancel them.

Closes #11870

Inside the Docker container that runs all the build process, we execute a simple `curl` command to hit our API at `/build/<id>/healthcheck/` every 2 seconds to communicate the backend the build is healthy. The backend runs a periodic task every 5 seconds and check for those builds that haven't had activity in the last 30 seconds and cancel those builds. Closes #11870

See jrobichaud/django-structlog#870

humitos · 2025-07-18T09:43:36Z

We should probably revert these changes as well, since this PR should fix the underlying issue: #12338

We differentiate them via a feature flag.

ericholscher

This looks like a good approach, but in the past, we've had issues with celery canceling tasks, so I'm not too confident that this will work, but it seems worth trying.

Most of my feedback is around approaches for the future, I think we can merge as is and test it with a small set of projects.

ericholscher · 2025-07-22T06:18:07Z

readthedocs/doc_builder/director.py

@@ -130,6 +131,7 @@ def create_build_environment(self):
            build=self.data.build,
            environment=self.get_build_env_vars(),
            api_client=self.data.api_client,
+            build_api_key=self.data.build_api_key,


How was this getting passed into the build before? Seems like we might want to keep using it the same way.

This value is passed as a task argument at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L1027 and then it's used at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L384

Since it's not saved anywhere, we are not able to access it from some inner methods/functions. That's why I had to save this value inside the self.data object we use to share all across the build process we control.

In summary, this value was not used anywhere else than where it was received and that's why we weren't saving it inside the self.data object.

readthedocs/doc_builder/environments.py

ericholscher · 2025-07-22T06:21:24Z

readthedocs/doc_builder/environments.py

+            # (there are 3 containers involved locally here: web, build, and user's build)
+            #
+            # This shouldn't happen in production, because we are not doing Docker in Docker.
+            url = f"http://readthedocs.ngrok.io{healthcheck_url}"


I would really like to find a way to make this work locally without configuration, otherwise we will rarely test that it still works.

This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.

Me too, but I wasn't able to find that way. Let me know if you have some ideas on what I may be doing wrong here and I can test them.

readthedocs/projects/tasks/utils.py

ericholscher · 2025-07-22T06:28:15Z

readthedocs/projects/tasks/utils.py

@@ -118,6 +166,7 @@ def finish_inactive_builds():
        ~Q(state__in=BUILD_FINAL_STATES)
        & Q(date__lt=timezone.now() - delta)
        & Q(date__gt=timezone.now() - datetime.timedelta(days=1))
+        & ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK)


Do we think this is part of what's causing our build retrying issue? Otherwise, I think we still want this to apply.

I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.

If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.

No, this task should be unrelated to the retrying issue we are suffering.

Here, I'm just removing the builds that using the feature flag to be skipped in the old task since they are managed by the new finish_unhealthy_builds now.

My idea is to delete this old task once we are fully on the healthcheck polling.

readthedocs/settings/base.py

agjohnson · 2025-07-22T07:08:28Z

readthedocs/doc_builder/environments.py

+            # (there are 3 containers involved locally here: web, build, and user's build)
+            #
+            # This shouldn't happen in production, because we are not doing Docker in Docker.
+            url = f"http://readthedocs.ngrok.io{healthcheck_url}"


This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.

readthedocs/projects/tasks/utils.py

agjohnson · 2025-07-22T07:28:22Z

readthedocs/projects/tasks/utils.py

@@ -118,6 +166,7 @@ def finish_inactive_builds():
        ~Q(state__in=BUILD_FINAL_STATES)
        & Q(date__lt=timezone.now() - delta)
        & Q(date__gt=timezone.now() - datetime.timedelta(days=1))
+        & ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK)


I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.

If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.

readthedocs/settings/base.py

humitos · 2025-07-22T07:43:50Z

we've had issues with celery canceling tasks, so I'm not too confident that this will work, but it seems worth trying.

I don't think this is an issue, honestly. The "Cancel" button in the build details page uses the same call and it has been working fine for a long time. Also, the same pattern is used when 2 builds for the same version are triggered, the first one is canceled with the same technique. In fact, I would call this behavior reliably at this point.

humitos · 2025-07-22T07:53:00Z

Thanks a lot for the feedback. I think the main missing thing is to make this work without NGROK when running locally. We can research about that later. Let's try this in production first with those projects we know they are failing currently and see if we can make them build 👍🏼

Continuation of #12332

humitos added 5 commits July 17, 2025 14:55

Builds: update finish_inactive_builds to use Build.healthcheck

4e09d76

Use variables for healthcheck delay and timeout

cf411bc

Rename setting

ca5332c

Use a feature flag for now

176e187

humitos mentioned this pull request Jul 17, 2025

Build: build process polling healthcheck #11870

Closed

Use string for the signal

b77e7cf

See jrobichaud/django-structlog#870

humitos mentioned this pull request Jul 17, 2025

Build: Build command HttpNotFoundError: Client Error 404 #12317

Open

humitos added 4 commits July 21, 2025 18:26

Keep new healthcheck and old behavior working together

77ef6c4

We differentiate them via a feature flag.

Skip projects with the feature flag in the old check

181cf8c

Improve comments

35322d1

Use debug loglevel

a38475e

humitos marked this pull request as ready for review July 21, 2025 16:33

humitos requested a review from a team as a code owner July 21, 2025 16:33

humitos requested a review from ericholscher July 21, 2025 16:33

auto-assign bot assigned humitos Jul 21, 2025

Style

7a2fcb9

humitos requested a review from agjohnson July 21, 2025 18:50

ericholscher approved these changes Jul 22, 2025

View reviewed changes

agjohnson approved these changes Jul 22, 2025

View reviewed changes

Add docstring

d2f656d

humitos added 2 commits July 22, 2025 09:45

Update docstring

9b1eb5b

Increase healthcheck delay

e6e1022

humitos enabled auto-merge (squash) July 22, 2025 07:54

humitos merged commit ab1d7c7 into main Jul 22, 2025
4 checks passed

humitos deleted the humitos/build-healthcheck branch July 22, 2025 08:07

humitos added a commit that referenced this pull request Jul 24, 2025

Build: test for healthcheck

5c08e2a

Continuation of #12332

humitos mentioned this pull request Jul 24, 2025

Build: test for healthcheck #12359

Open

Uh oh!

Builds: run a healthcheck command in the background #12332

Builds: run a healthcheck command in the background #12332

Uh oh!

Conversation

humitos commented Jul 17, 2025

Uh oh!

humitos commented Jul 18, 2025

Uh oh!

ericholscher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

humitos Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

humitos commented Jul 22, 2025

Uh oh!

humitos commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

humitos Jul 22, 2025 •

edited

Loading