-
-
Notifications
You must be signed in to change notification settings - Fork 936
fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…, creating large event loop lag
|
WalkthroughTelemetry instrumentation in the fair-queue module is refactored to remove messageAttributes objects from method calls. Affected telemetry calls include enqueue, enqueueBatch, recordQueueTime, recordComplete, recordProcessingTime, recordFailure, recordRetry, and recordDLQ operations. The corresponding JSDoc in the telemetry module is updated to clarify that messageAttributes creates standard attributes for spans and traces, noting that high cardinality attributes are acceptable in this context. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (3)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{js,ts,jsx,tsx,json,md,css,scss}📄 CodeRabbit inference engine (AGENTS.md)
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
🔇 Additional comments (9)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review CompleteYour review story is ready! Comment !reviewfast on this PR to re-generate the story. |
Fix: Prevent unbounded memory growth from metrics cardinality explosion
Problem
After deploying the batch queue system, the engine worker service experienced event loop lag (5-8+ seconds) that
would build up over 5-6 hours of operation and resolve on restart. Investigation revealed that OpenTelemetry metrics
were being recorded with high-cardinality attributes (messageId, queueId), causing the metrics SDK to accumulate
unbounded aggregator state over time.
Changes
Why this works
Each unique attribute combination in OpenTelemetry metrics creates a new time series with its own aggregator. With
messageId (unique per message) and queueId (unique per batch) as attributes, the SDK was accumulating millions of
aggregators over hours of operation, causing: