fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819

ericallam · 2025-12-25T08:25:29Z

Fix: Prevent unbounded memory growth from metrics cardinality explosion

Problem

After deploying the batch queue system, the engine worker service experienced event loop lag (5-8+ seconds) that
would build up over 5-6 hours of operation and resolve on restart. Investigation revealed that OpenTelemetry metrics
were being recorded with high-cardinality attributes (messageId, queueId), causing the metrics SDK to accumulate
unbounded aggregator state over time.

Changes

Remove high-cardinality attributes from all metric recording calls in FairQueue
Metrics are now recorded without dimensions to prevent cardinality explosion
Span/trace attributes remain unchanged for detailed per-message observability
Simplify getDynamicAttributes callback to only include cache sizes

Why this works

Each unique attribute combination in OpenTelemetry metrics creates a new time series with its own aggregator. With
messageId (unique per message) and queueId (unique per batch) as attributes, the SDK was accumulating millions of
aggregators over hours of operation, causing:

Slow memory growth
Network TX growing faster than memory (more data to export)
Event loop blocking during SDK operations (export, GC)

…, creating large event loop lag

changeset-bot · 2025-12-25T08:25:33Z

⚠️ No Changeset found

Latest commit: 6d57509

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-12-25T08:25:45Z

Walkthrough

Telemetry instrumentation in the fair-queue module is refactored to remove messageAttributes objects from method calls. Affected telemetry calls include enqueue, enqueueBatch, recordQueueTime, recordComplete, recordProcessingTime, recordFailure, recordRetry, and recordDLQ operations. The corresponding JSDoc in the telemetry module is updated to clarify that messageAttributes creates standard attributes for spans and traces, noting that high cardinality attributes are acceptable in this context.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description provides thorough context but is missing the required checklist, testing steps, and changelog sections from the template.	Complete the PR template by adding the required checklist with items marked, describe testing steps, and provide a concise changelog entry.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: preventing cardinality explosion in metrics reporting that causes event loop lag.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ea-branch-112

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c3dfac and 6d57509.

📒 Files selected for processing (2)

packages/redis-worker/src/fair-queue/index.ts
packages/redis-worker/src/fair-queue/telemetry.ts

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

packages/redis-worker/src/fair-queue/telemetry.ts
packages/redis-worker/src/fair-queue/index.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

packages/redis-worker/src/fair-queue/telemetry.ts
packages/redis-worker/src/fair-queue/index.ts

**/*.{js,ts,jsx,tsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier

Files:

packages/redis-worker/src/fair-queue/telemetry.ts
packages/redis-worker/src/fair-queue/index.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)

GitHub Check: Cursor Bugbot
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (9)

packages/redis-worker/src/fair-queue/telemetry.ts (1)

327-328: Excellent clarification of intended usage.

The updated JSDoc clearly distinguishes that messageAttributes is for spans/traces where high cardinality is acceptable, not for metrics. This aligns with observability best practices and the changes in index.ts.

packages/redis-worker/src/fair-queue/index.ts (8)

330-330: LGTM! Metric cardinality fixed.

Removing the attributes parameter prevents cardinality explosion from unique message/queue IDs. Detailed information remains available in the span attributes set on lines 323-328.

428-428: LGTM! Batch metric simplified.

Recording only the count prevents cardinality issues while still tracking enqueue volume. Per-message details remain in span attributes (lines 421-426).

1378-1378: LGTM! Queue time metric properly scoped.

Recording only the duration value prevents per-message cardinality explosion while preserving the ability to aggregate queue time metrics across all messages.

1394-1395: LGTM! Completion metrics simplified.

Both completion count and processing time metrics now avoid high-cardinality attributes, enabling proper aggregation without memory pressure from excessive metric series.

1521-1521: LGTM! Failure metric properly scoped.

Removing attributes allows tracking overall failure rates without creating per-message metric series.

1552-1552: LGTM! Retry metric simplified.

Recording retries without per-message attributes enables proper aggregation of retry rates across the system.

1608-1608: LGTM! DLQ metric properly scoped.

Removing attributes from the dead letter queue metric prevents cardinality explosion while maintaining the ability to track DLQ volume.

330-1608: All telemetry metric method calls across the codebase follow the expected pattern and do not pass high-cardinality attributes. Verification confirms:

recordEnqueue(), recordComplete(), recordFailure(), recordRetry(), and recordDLQ() are all called with no parameters

recordQueueTime(), recordProcessingTime(), and recordEnqueueBatch() are called only with their required value/count parameters

No attributes are passed to any of these methods in the entire codebase

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

vibe-kanban-cloud · 2025-12-25T08:28:40Z

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

fix(fair-queue): prevent cardinality explosion when reporting metrics…

6d57509

…, creating large event loop lag

mpcgrid approved these changes Dec 25, 2025

View reviewed changes

ericallam changed the title ~~fix(fair-queue): prevent cardinality explosion when reporting metrics, creating large event loop lag~~ fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion Dec 25, 2025

ericallam merged commit edf5b14 into main Dec 25, 2025
32 checks passed

ericallam deleted the ea-branch-112 branch December 25, 2025 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819

fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819

Uh oh!

ericallam commented Dec 25, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Dec 25, 2025

Uh oh!

coderabbitai bot commented Dec 25, 2025 •

edited

Loading

Uh oh!

vibe-kanban-cloud bot commented Dec 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819

fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion #2819

Uh oh!

Conversation

ericallam commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Dec 25, 2025

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

vibe-kanban-cloud bot commented Dec 25, 2025

Review Complete

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericallam commented Dec 25, 2025 •

edited

Loading

coderabbitai bot commented Dec 25, 2025 •

edited

Loading