Skip to content

Conversation

@Peyton232
Copy link
Contributor

MINT-3448

Description

This change involves add retry logic to only error on the final retry of a faktory job and instead throw warnings up until the final retry

How to test this change

  1. Run a bunch of faktory jobs
  2. I added a line to TranslateAuditBatchJob that would cause it to fail 85% of the time and then just checked the app logs after running dbseed
  3. something like this
// Add this line to fail 85% of the time
	if rand.Float64() < 0.85 {
		return errors.New("simulated failure for testing")
	}

This does take a little bit since this is on cron job and also Faktory has exponential backoff on jobs

PR Author Checklist

  • I have provided a detailed description of the changes in this PR.
  • I have provided clear instructions on how to test the changes in this PR.
  • I have updated tests or written new tests as appropriate in this PR.
  • Updated the BRUNO Collection if necessary.

PR Reviewer Guidelines

  • It's best to pull the branch locally and test it, rather than just looking at the code online!
  • When approving a PR, provide a reason why you're approving it
    • e.g. "Approving because I tested it locally and all functionality works as expected"
    • e.g. "Approving because the change is simple and matches the Figma design"
  • Don't be afraid to leave comments or ask questions, especially if you don't understand why something was done! (This is often a great time to suggest code comments or documentation updates)
  • Check that all code is adequately covered by tests - if it isn't feel free to suggest the addition of tests.

@Peyton232 Peyton232 requested a review from a team as a code owner October 23, 2025 14:14
@Peyton232 Peyton232 requested review from StevenWadeOddball, Copilot and patrickseguraoddball and removed request for a team October 23, 2025 14:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds retry-aware logging middleware to Faktory job processing, ensuring that job failures only log as errors on the final retry attempt, while earlier failures log as warnings.

Key changes:

  • Implemented RetryAwareLogging middleware that distinguishes between retriable failures (warnings) and final failures (errors)
  • Added comprehensive test coverage for the retry-aware logging behavior across different retry scenarios
  • Integrated the middleware into the worker manager and added a minor logging improvement to the digest cron job

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/worker/worker.go Added default max retries constant and registered the retry-aware logging middleware
pkg/worker/job_utilities.go Implemented the RetryAwareLogging middleware function with retry count logic
pkg/worker/job_utilities_test.go Added comprehensive tests for retry-aware logging behavior including edge cases
pkg/worker/digest_cron_job.go Minor refactoring to add day parameter to logging for better observability

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@StevenWadeOddball StevenWadeOddball self-assigned this Oct 23, 2025
@StevenWadeOddball
Copy link
Collaborator

Thanks Peyton, still looking, but I'm running into some issues testing I wanted to alert you to while I continue review. Something is panicking, it looks like it's from the new logic.

2025-10-23 08:27:00 2025-10-23T14:27:00.756Z    WARN    worker/job_utilities.go:115     job failed; will retry  {"jid": "i0VcBI12N-Ab2WQc", "job_type": "ModelStatusUpdateJob", "fail_count_so_far": 0, "max_retries": 2, "error": "recovered from panic. Error: runtime error: invalid memory address or nil pointer dereference"}
2025-10-23 08:27:00 github.com/cms-enterprise/mint-app/pkg/worker.RetryAwareLogging.func1
2025-10-23 08:27:00     /mint/pkg/worker/job_utilities.go:115
2025-10-23 08:27:00 github.com/contribsys/faktory_worker_go.dispatch
2025-10-23 08:27:00     /go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/middleware.go:24
2025-10-23 08:27:00 github.com/contribsys/faktory_worker_go.(*Manager).dispatch
2025-10-23 08:27:00     /go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/manager.go:66
2025-10-23 08:27:00 github.com/contribsys/faktory_worker_go.processOne
2025-10-23 08:27:00     /go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/runner.go:159
2025-10-23 08:27:00 github.com/contribsys/faktory_worker_go.process
2025-10-23 08:27:00     /go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/runner.go:101
2025-10-23 08:27:00 2025/10/23 14:27:00.756614 Error running ModelStatusUpdateJob job i0VcBI12N-Ab2WQc: recovered from panic. Error: runtime error: invalid memory address or nil pointer dereference

Copy link
Collaborator

@StevenWadeOddball StevenWadeOddball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Peyton232 , I don't think this will address the current issue with translated audit logging an error, will it? the error is logged currently from the translated audit package. We pass a logger, but we are calling the error method directly there. Example from recent error message. I think this just adds additional logging, but doesn't conditionally silence logged errors

{"level":"error","ts":1760992321.4104438,"caller":"translatedaudit/translated_audit.go:66","msg":"issue saving translated audit.","app_section":"faktory","JID":"0Il7U53WmIU5wD-K","job_type":"TranslateAuditJob","traceID":"85ab5c78-d9cc-41af-aade-4e5ae5004226","change_id":12230,"translated_audit_queue_id":"20d79530-9201-4b61-99ad-52e3a1355f01","error":"issue creating new TranslatedAuditChange object: dbErr: message: issue executing named statement, error pq: insert or update on table \"translated_audit\" violates foreign key constraint \"translated_audit_model_plan_id_fkey\"","stacktrace":"[github.com/cms-enterprise/mint-app/pkg/translatedaudit.TranslateAudit](http://github.com/cms-enterprise/mint-app/pkg/translatedaudit.TranslateAudit)\n\t/mint/pkg/translatedaudit/translated_audit.go:66\[ngithub.com/cms-enterprise/mint-app/pkg/translatedaudit.TranslateAuditJobByID](http://ngithub.com/cms-enterprise/mint-app/pkg/translatedaudit.TranslateAuditJobByID)\n\t/mint/pkg/translatedaudit/translated_audit_job_utitilities.go:35\[ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).TranslateAuditJob](http://ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).TranslateAuditJob)\n\t/mint/pkg/worker/translate_audit_job.go:57\[ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).Work.JobWithPanicProtection.func1](http://ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).Work.JobWithPanicProtection.func1)\n\t/mint/pkg/worker/job_utilities.go:19\[ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).Work.(*Manager).Register.func2](http://ngithub.com/cms-enterprise/mint-app/pkg/worker.(*Worker).Work.(*Manager).Register.func2)\n\t/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/manager.go:49\[ngithub.com/contribsys/faktory_worker_go.dispatch](http://ngithub.com/contribsys/faktory_worker_go.dispatch)\n\t/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/middleware.go:19\[ngithub.com/contribsys/faktory_worker_go.(*Manager).dispatch](http://ngithub.com/contribsys/faktory_worker_go.(*Manager).dispatch)\n\t/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/manager.go:66\[ngithub.com/contribsys/faktory_worker_go.processOne](http://ngithub.com/contribsys/faktory_worker_go.processOne)\n\t/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/runner.go:159\[ngithub.com/contribsys/faktory_worker_go.process](http://ngithub.com/contribsys/faktory_worker_go.process)\n\t/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.7.0/runner.go:101"}

@Peyton232 Peyton232 marked this pull request as draft October 28, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants