-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Builds: run a healthcheck command in the background #12332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
a828d35
4e09d76
cf411bc
ca5332c
176e187
b77e7cf
77ef6c4
181cf8c
35322d1
a38475e
7a2fcb9
d2f656d
9b1eb5b
e6e1022
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Generated by Django 5.2.4 on 2025-07-17 11:39 | ||
|
||
from django.db import migrations | ||
from django.db import models | ||
from django_safemigrate import Safe | ||
|
||
|
||
class Migration(migrations.Migration): | ||
safe = Safe.before_deploy() | ||
|
||
dependencies = [ | ||
("builds", "0063_alter_buildcommandresult"), | ||
] | ||
|
||
operations = [ | ||
migrations.AddField( | ||
model_name="build", | ||
name="healthcheck", | ||
field=models.DateTimeField(blank=True, null=True, verbose_name="Healthcheck"), | ||
), | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
|
||
import structlog | ||
from django.conf import settings | ||
from django.urls import reverse | ||
from django.utils.translation import gettext_lazy as _ | ||
from docker import APIClient | ||
from docker.errors import APIError as DockerAPIError | ||
|
@@ -19,6 +20,7 @@ | |
|
||
from readthedocs.builds.models import BuildCommandResultMixin | ||
from readthedocs.core.utils import slugify | ||
from readthedocs.projects.models import Feature | ||
|
||
from .constants import DOCKER_HOSTNAME_MAX_LEN | ||
from .constants import DOCKER_IMAGE | ||
|
@@ -573,6 +575,7 @@ class DockerBuildEnvironment(BaseBuildEnvironment): | |
container_time_limit = DOCKER_LIMITS.get("time") | ||
|
||
def __init__(self, *args, **kwargs): | ||
self.build_api_key = kwargs.pop("build_api_key", None) | ||
container_image = kwargs.pop("container_image", None) | ||
super().__init__(*args, **kwargs) | ||
self.client = None | ||
|
@@ -829,7 +832,45 @@ def create_container(self): | |
runtime="runsc", # gVisor runtime | ||
) | ||
client.start(container=self.container_id) | ||
|
||
if self.project.has_feature(Feature.BUILD_HEALTHCHECK): | ||
self._run_background_healthcheck() | ||
|
||
except (DockerAPIError, ConnectionError) as exc: | ||
raise BuildAppError( | ||
BuildAppError.GENERIC_WITH_BUILD_ID, exception_messag=exc.explanation | ||
) from exc | ||
|
||
def _run_background_healthcheck(self): | ||
log.debug("Running build with healthcheck.") | ||
|
||
# Run a healthcheck that will ping our API constantly. | ||
# We run a Celery task to check the running build has a valid healthcheck timestamp, | ||
# if that's not the case, these builds are terminated and marked as failed. | ||
build_id = self.build.get("id") | ||
|
||
healthcheck_url = reverse("build-healthcheck", kwargs={"pk": build_id}) | ||
if settings.RTD_DOCKER_COMPOSE and "ngrok" in settings.PRODUCTION_DOMAIN: | ||
# NOTE: we do require using NGROK here to go over internet because I | ||
# didn't find a way to access the `web` container from inside the | ||
# container the `build` container created for this particular build | ||
# (there are 3 containers involved locally here: web, build, and user's build) | ||
# | ||
# This shouldn't happen in production, because we are not doing Docker in Docker. | ||
url = f"http://readthedocs.ngrok.io{healthcheck_url}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would really like to find a way to make this work locally without configuration, otherwise we will rarely test that it still works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Me too, but I wasn't able to find that way. Let me know if you have some ideas on what I may be doing wrong here and I can test them. |
||
else: | ||
url = f"{settings.SLUMBER_API_HOST}{healthcheck_url}" | ||
|
||
cmd = f"/bin/bash -c 'while true; do curl --max-time 2 -H \"Authorization: Token {self.build_api_key}\" -X POST {url}; sleep {settings.RTD_BUILD_HEALTHCHECK_DELAY}; done;'" | ||
log.debug("Healthcheck command to run.", command=cmd) | ||
|
||
client = self.get_client() | ||
exec_cmd = client.exec_create( | ||
humitos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
container=self.container_id, | ||
cmd=cmd, | ||
user=settings.RTD_DOCKER_USER, | ||
stdout=True, | ||
stderr=True, | ||
) | ||
# `detach=True` allows us to run this command in the background | ||
client.exec_start(exec_id=exec_cmd["Id"], stream=False, detach=True) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,7 @@ | |
from readthedocs.core.utils.filesystem import safe_rmtree | ||
from readthedocs.doc_builder.exceptions import BuildAppError | ||
from readthedocs.notifications.models import Notification | ||
from readthedocs.projects.models import Feature | ||
from readthedocs.storage import build_media_storage | ||
from readthedocs.worker import app | ||
|
||
|
@@ -95,6 +96,53 @@ def clean_project_resources(project, version=None, version_slug=None): | |
project.imported_files.all().delete() | ||
|
||
|
||
@app.task() | ||
def finish_unhealthy_builds(): | ||
""" | ||
Finish inactive builds. | ||
|
||
A build is consider inactive if the last healthcheck reported was more than | ||
30 seconds ago. | ||
humitos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
These inactive builds will be marked as ``success=False`` and | ||
``state=CANCELLED`` with an ``error`` to be communicated to the user. | ||
""" | ||
log.debug("Running task to finish inactive builds (no healtcheck received).") | ||
delta = datetime.timedelta(seconds=settings.RTD_BUILD_HEALTHCHECK_TIMEOUT) | ||
query = ( | ||
~Q(state__in=BUILD_FINAL_STATES) | ||
& Q(healthcheck__lt=timezone.now() - delta) | ||
& Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK) | ||
) | ||
|
||
projects_finished = set() | ||
builds_finished = [] | ||
builds = Build.objects.filter(query)[:50] | ||
for build in builds: | ||
build.success = False | ||
build.state = BUILD_STATE_CANCELLED | ||
build.save() | ||
|
||
# Tell Celery to cancel this task in case it's in a zombie state. | ||
app.control.revoke(build.task_id, signal="SIGINT", terminate=True) | ||
humitos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Notification.objects.add( | ||
message_id=BuildAppError.BUILD_TERMINATED_DUE_INACTIVITY, | ||
attached_to=build, | ||
) | ||
|
||
builds_finished.append(build.pk) | ||
projects_finished.add(build.project.slug) | ||
|
||
if builds_finished: | ||
log.info( | ||
'Builds marked as "Terminated due inactivity" (not healthcheck received).', | ||
count=len(builds_finished), | ||
project_slugs=projects_finished, | ||
build_pks=builds_finished, | ||
) | ||
|
||
|
||
@app.task() | ||
def finish_inactive_builds(): | ||
""" | ||
|
@@ -118,6 +166,7 @@ def finish_inactive_builds(): | |
~Q(state__in=BUILD_FINAL_STATES) | ||
& Q(date__lt=timezone.now() - delta) | ||
& Q(date__gt=timezone.now() - datetime.timedelta(days=1)) | ||
& ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we think this is part of what's causing our build retrying issue? Otherwise, I think we still want this to apply. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways. If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, this task should be unrelated to the retrying issue we are suffering. Here, I'm just removing the builds that using the feature flag to be skipped in the old task since they are managed by the new My idea is to delete this old task once we are fully on the healthcheck polling. |
||
) | ||
|
||
projects_finished = set() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this getting passed into the build before? Seems like we might want to keep using it the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is passed as a task argument at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L1027 and then it's used at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L384
Since it's not saved anywhere, we are not able to access it from some inner methods/functions. That's why I had to save this value inside the
self.data
object we use to share all across the build process we control.In summary, this value was not used anywhere else than where it was received and that's why we weren't saving it inside the
self.data
object.