Skip to content

Builds: run a healthcheck command in the background #12332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 22, 2025

Conversation

humitos
Copy link
Member

@humitos humitos commented Jul 17, 2025

Inside the Docker container that runs all the build process, we execute a simple curl command to hit our API at /build/<id>/healthcheck/ every RTD_BUILD_HEALTHCHECK_DELAY seconds to communicate the backend the build is healthy.

The backend runs a periodic task every 1 minute and check for those builds that haven't had activity in the last RTD_BUILD_HEALTHCHECK_TIMEOUT seconds and cancel them.

Closes #11870

humitos added 5 commits July 17, 2025 14:55
Inside the Docker container that runs all the build process, we execute a simple
`curl` command to hit our API at `/build/<id>/healthcheck/` every 2 seconds to
communicate the backend the build is healthy.

The backend runs a periodic task every 5 seconds and check for those builds that
haven't had activity in the last 30 seconds and cancel those builds.

Closes #11870
@humitos
Copy link
Member Author

humitos commented Jul 18, 2025

We should probably revert these changes as well, since this PR should fix the underlying issue: #12338

@humitos humitos marked this pull request as ready for review July 21, 2025 16:33
@humitos humitos requested a review from a team as a code owner July 21, 2025 16:33
@humitos humitos requested a review from ericholscher July 21, 2025 16:33
@humitos humitos requested a review from agjohnson July 21, 2025 18:50
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good approach, but in the past, we've had issues with celery canceling tasks, so I'm not too confident that this will work, but it seems worth trying.

Most of my feedback is around approaches for the future, I think we can merge as is and test it with a small set of projects.

@@ -130,6 +131,7 @@ def create_build_environment(self):
build=self.data.build,
environment=self.get_build_env_vars(),
api_client=self.data.api_client,
build_api_key=self.data.build_api_key,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this getting passed into the build before? Seems like we might want to keep using it the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is passed as a task argument at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L1027 and then it's used at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L384

Since it's not saved anywhere, we are not able to access it from some inner methods/functions. That's why I had to save this value inside the self.data object we use to share all across the build process we control.

In summary, this value was not used anywhere else than where it was received and that's why we weren't saving it inside the self.data object.

# (there are 3 containers involved locally here: web, build, and user's build)
#
# This shouldn't happen in production, because we are not doing Docker in Docker.
url = f"http://readthedocs.ngrok.io{healthcheck_url}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to find a way to make this work locally without configuration, otherwise we will rarely test that it still works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, but I wasn't able to find that way. Let me know if you have some ideas on what I may be doing wrong here and I can test them.

@@ -118,6 +166,7 @@ def finish_inactive_builds():
~Q(state__in=BUILD_FINAL_STATES)
& Q(date__lt=timezone.now() - delta)
& Q(date__gt=timezone.now() - datetime.timedelta(days=1))
& ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this is part of what's causing our build retrying issue? Otherwise, I think we still want this to apply.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.

If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.

Copy link
Member Author

@humitos humitos Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this task should be unrelated to the retrying issue we are suffering.

Here, I'm just removing the builds that using the feature flag to be skipped in the old task since they are managed by the new finish_unhealthy_builds now.

My idea is to delete this old task once we are fully on the healthcheck polling.

# (there are 3 containers involved locally here: web, build, and user's build)
#
# This shouldn't happen in production, because we are not doing Docker in Docker.
url = f"http://readthedocs.ngrok.io{healthcheck_url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.

@@ -118,6 +166,7 @@ def finish_inactive_builds():
~Q(state__in=BUILD_FINAL_STATES)
& Q(date__lt=timezone.now() - delta)
& Q(date__gt=timezone.now() - datetime.timedelta(days=1))
& ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.

If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.

@humitos
Copy link
Member Author

humitos commented Jul 22, 2025

we've had issues with celery canceling tasks, so I'm not too confident that this will work, but it seems worth trying.

I don't think this is an issue, honestly. The "Cancel" button in the build details page uses the same call and it has been working fine for a long time. Also, the same pattern is used when 2 builds for the same version are triggered, the first one is canceled with the same technique. In fact, I would call this behavior reliably at this point.

@humitos
Copy link
Member Author

humitos commented Jul 22, 2025

Thanks a lot for the feedback. I think the main missing thing is to make this work without NGROK when running locally. We can research about that later. Let's try this in production first with those projects we know they are failing currently and see if we can make them build 👍🏼

@humitos humitos enabled auto-merge (squash) July 22, 2025 07:54
@humitos humitos merged commit ab1d7c7 into main Jul 22, 2025
4 checks passed
@humitos humitos deleted the humitos/build-healthcheck branch July 22, 2025 08:07
humitos added a commit that referenced this pull request Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build: build process polling healthcheck
3 participants