-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Builds: run a healthcheck command in the background #12332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Inside the Docker container that runs all the build process, we execute a simple `curl` command to hit our API at `/build/<id>/healthcheck/` every 2 seconds to communicate the backend the build is healthy. The backend runs a periodic task every 5 seconds and check for those builds that haven't had activity in the last 30 seconds and cancel those builds. Closes #11870
We should probably revert these changes as well, since this PR should fix the underlying issue: #12338 |
We differentiate them via a feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good approach, but in the past, we've had issues with celery canceling tasks, so I'm not too confident that this will work, but it seems worth trying.
Most of my feedback is around approaches for the future, I think we can merge as is and test it with a small set of projects.
@@ -130,6 +131,7 @@ def create_build_environment(self): | |||
build=self.data.build, | |||
environment=self.get_build_env_vars(), | |||
api_client=self.data.api_client, | |||
build_api_key=self.data.build_api_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this getting passed into the build before? Seems like we might want to keep using it the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is passed as a task argument at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L1027 and then it's used at https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/projects/tasks/builds.py#L384
Since it's not saved anywhere, we are not able to access it from some inner methods/functions. That's why I had to save this value inside the self.data
object we use to share all across the build process we control.
In summary, this value was not used anywhere else than where it was received and that's why we weren't saving it inside the self.data
object.
# (there are 3 containers involved locally here: web, build, and user's build) | ||
# | ||
# This shouldn't happen in production, because we are not doing Docker in Docker. | ||
url = f"http://readthedocs.ngrok.io{healthcheck_url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to find a way to make this work locally without configuration, otherwise we will rarely test that it still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, but I wasn't able to find that way. Let me know if you have some ideas on what I may be doing wrong here and I can test them.
@@ -118,6 +166,7 @@ def finish_inactive_builds(): | |||
~Q(state__in=BUILD_FINAL_STATES) | |||
& Q(date__lt=timezone.now() - delta) | |||
& Q(date__gt=timezone.now() - datetime.timedelta(days=1)) | |||
& ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this is part of what's causing our build retrying issue? Otherwise, I think we still want this to apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.
If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this task should be unrelated to the retrying issue we are suffering.
Here, I'm just removing the builds that using the feature flag to be skipped in the old task since they are managed by the new finish_unhealthy_builds
now.
My idea is to delete this old task once we are fully on the healthcheck polling.
# (there are 3 containers involved locally here: web, build, and user's build) | ||
# | ||
# This shouldn't happen in production, because we are not doing Docker in Docker. | ||
url = f"http://readthedocs.ngrok.io{healthcheck_url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also eventually be the default mode too, so we'll definitely need to find out a better solution. But this doesn't need to be figured out just yet, I feel we could even omit this local development hardcoding from our code for now and return to this later.
@@ -118,6 +166,7 @@ def finish_inactive_builds(): | |||
~Q(state__in=BUILD_FINAL_STATES) | |||
& Q(date__lt=timezone.now() - delta) | |||
& Q(date__gt=timezone.now() - datetime.timedelta(days=1)) | |||
& ~Q(project__feature__feature_id=Feature.BUILD_HEALTHCHECK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see signs that this bug was caused by the task time limit. The task for build healthcheck polling should update the build state if the check fails, so this task shouldn't run on these builds anyways.
If there is a problem with that logic, we should probably use this task as a backup method to ending those builds gracefully.
I don't think this is an issue, honestly. The "Cancel" button in the build details page uses the same call and it has been working fine for a long time. Also, the same pattern is used when 2 builds for the same version are triggered, the first one is canceled with the same technique. In fact, I would call this behavior reliably at this point. |
Thanks a lot for the feedback. I think the main missing thing is to make this work without NGROK when running locally. We can research about that later. Let's try this in production first with those projects we know they are failing currently and see if we can make them build 👍🏼 |
Inside the Docker container that runs all the build process, we execute a simple
curl
command to hit our API at/build/<id>/healthcheck/
everyRTD_BUILD_HEALTHCHECK_DELAY
seconds to communicate the backend the build is healthy.The backend runs a periodic task every 1 minute and check for those builds that haven't had activity in the last
RTD_BUILD_HEALTHCHECK_TIMEOUT
seconds and cancel them.Closes #11870