Skip to content

fix(sentry apps): Self-heal/regenereate broken ServiceHooks #96712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Christinarlong
Copy link
Contributor

Finally got around to doing this... In the past some Sentry Apps have been updated and their ServiceHooks(read: webhooks) weren't updated properly for various reasons. Leaving people with broken webhooks. That also caused these Sentry errors to surface to us

  1. https://sentry.sentry.io/issues/6682915846/?project=1&query=is%3Aunresolved%20missing_servicehook&referrer=issue-stream
  2. https://sentry.sentry.io/issues/6678511610/?project=1&query=operation_type%3Asend_webhook%20event_type%3Aissue.unresolved&referrer=issue-stream

This PR adds a task that will gradually regenerate the broken webhooks by re-running the create_or_update_service_hooks_for_installation function. That function is what should've been run during the initial webhook update but for various reason (timeouts, errors etc.) didn't complete.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jul 29, 2025
Comment on lines 810 to 812
def regenerate_service_hook_for_installation(
installation_id: int, servicehook_events: list[str] | None
) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function would be used whenever we encounter the SentryAppWebhookFailureReason.MISSING_SERVICEHOOK or SentryAppWebhookFailureReason.EVENT_NOT_IN_SERVCEHOOK failure cases.

In both cases we usually should have an installation with a working ServiceHook a la we use the Sentry app's event list as the source of truth for all installations

This logic was put into another task because I didn't want another logically separate operation to potentially pollute the results/info of another task. And similarly allows us to get better data on when this operation is ran.

Copy link

codecov bot commented Jul 29, 2025

Codecov Report

❌ Patch coverage is 37.50000% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/sentry_apps/tasks/sentry_apps.py 33.33% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #96712      +/-   ##
==========================================
- Coverage   80.63%   80.15%   -0.49%     
==========================================
  Files        8475     8483       +8     
  Lines      373193   379104    +5911     
  Branches    24205    24205              
==========================================
+ Hits       300929   303857    +2928     
- Misses      71891    74874    +2983     
  Partials      373      373              

@Christinarlong Christinarlong marked this pull request as ready for review July 30, 2025 00:06
@Christinarlong Christinarlong requested review from a team as code owners July 30, 2025 00:06
cursor[bot]

This comment was marked as outdated.

assert installation is not None, "Installation must exist to regenerate service hooks"
app_events = installation.sentry_app.events

if servicehook_events is None or set(servicehook_events) != set(app_events):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this more idk if we should even have these checks, like I think we can assume the caller will have done the validation/check that the ServiceHook is in a bad state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with your intuition: i think this task should only be called if this if statement is true.

Copy link
Member

@iamrajjoshi iamrajjoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some a metric to the task so we can monitor how often it is triggers, and create a redash dashboard/query to show us the burndown rate of the broken servicehooks?

@@ -799,6 +799,38 @@ def send_webhooks(installation: RpcSentryAppInstallation, event: str, **kwargs:
)


@instrumented_task(
"sentry.sentry_apps.tasks.sentry_apps.regenerate_service_hook_for_installation",
taskworker_config=TaskworkerConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure we don't have to add anything extra from TASK_OPTIONS in the taskworker config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er im not sure if we need to add anything else. O I did bork the namespace though. I highkey just copied this from the other tasks

installation_id: int, servicehook_events: list[str] | None
) -> None:
installation = app_service.installation_by_id(id=installation_id)
assert installation is not None, "Installation must exist to regenerate service hooks"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we raise some integrity exception and add more context ie, installation_id, etc.
another idea is to add a logger.info with this context (it becomes a breadcrumb in the sentry error)

assert installation is not None, "Installation must exist to regenerate service hooks"
app_events = installation.sentry_app.events

if servicehook_events is None or set(servicehook_events) != set(app_events):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with your intuition: i think this task should only be called if this if statement is true.

@Christinarlong
Copy link
Contributor Author

can we add some a metric to the task so we can monitor how often it is triggers, and create a redash dashboard/query to show us the burndown rate of the broken servicehooks?

I think we can piggy back off the general task metric to see how often the task is triggered.

As for some stat/visualization of the burndown,

  1. A counter that shows the number of installations whose SH events dont match the ones of their sentry app counter part - https://redash.getsentry.net/queries/9297/source#12084

Also I was thinking we could track this via the open Sentry issues in the PR description. Since if those go down we can correlate that to burning down the broken servicehooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants