Skip to content

Conversation

@ericallam
Copy link
Member

@ericallam ericallam commented Dec 25, 2025

Fix: Prevent unbounded memory growth from metrics cardinality explosion

Problem

After deploying the batch queue system, the engine worker service experienced event loop lag (5-8+ seconds) that
would build up over 5-6 hours of operation and resolve on restart. Investigation revealed that OpenTelemetry metrics
were being recorded with high-cardinality attributes (messageId, queueId), causing the metrics SDK to accumulate
unbounded aggregator state over time.

Changes

  • Remove high-cardinality attributes from all metric recording calls in FairQueue
  • Metrics are now recorded without dimensions to prevent cardinality explosion
  • Span/trace attributes remain unchanged for detailed per-message observability
  • Simplify getDynamicAttributes callback to only include cache sizes

Why this works

Each unique attribute combination in OpenTelemetry metrics creates a new time series with its own aggregator. With
messageId (unique per message) and queueId (unique per batch) as attributes, the SDK was accumulating millions of
aggregators over hours of operation, causing:

  • Slow memory growth
  • Network TX growing faster than memory (more data to export)
  • Event loop blocking during SDK operations (export, GC)

@changeset-bot
Copy link

changeset-bot bot commented Dec 25, 2025

⚠️ No Changeset found

Latest commit: 6d57509

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

Walkthrough

Telemetry instrumentation in the fair-queue module is refactored to remove messageAttributes objects from method calls. Affected telemetry calls include enqueue, enqueueBatch, recordQueueTime, recordComplete, recordProcessingTime, recordFailure, recordRetry, and recordDLQ operations. The corresponding JSDoc in the telemetry module is updated to clarify that messageAttributes creates standard attributes for spans and traces, noting that high cardinality attributes are acceptable in this context.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides thorough context but is missing the required checklist, testing steps, and changelog sections from the template. Complete the PR template by adding the required checklist with items marked, describe testing steps, and provide a concise changelog entry.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: preventing cardinality explosion in metrics reporting that causes event loop lag.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ea-branch-112

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c3dfac and 6d57509.

📒 Files selected for processing (2)
  • packages/redis-worker/src/fair-queue/index.ts
  • packages/redis-worker/src/fair-queue/telemetry.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • packages/redis-worker/src/fair-queue/telemetry.ts
  • packages/redis-worker/src/fair-queue/index.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • packages/redis-worker/src/fair-queue/telemetry.ts
  • packages/redis-worker/src/fair-queue/index.ts
**/*.{js,ts,jsx,tsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier

Files:

  • packages/redis-worker/src/fair-queue/telemetry.ts
  • packages/redis-worker/src/fair-queue/index.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (9)
packages/redis-worker/src/fair-queue/telemetry.ts (1)

327-328: Excellent clarification of intended usage.

The updated JSDoc clearly distinguishes that messageAttributes is for spans/traces where high cardinality is acceptable, not for metrics. This aligns with observability best practices and the changes in index.ts.

packages/redis-worker/src/fair-queue/index.ts (8)

330-330: LGTM! Metric cardinality fixed.

Removing the attributes parameter prevents cardinality explosion from unique message/queue IDs. Detailed information remains available in the span attributes set on lines 323-328.


428-428: LGTM! Batch metric simplified.

Recording only the count prevents cardinality issues while still tracking enqueue volume. Per-message details remain in span attributes (lines 421-426).


1378-1378: LGTM! Queue time metric properly scoped.

Recording only the duration value prevents per-message cardinality explosion while preserving the ability to aggregate queue time metrics across all messages.


1394-1395: LGTM! Completion metrics simplified.

Both completion count and processing time metrics now avoid high-cardinality attributes, enabling proper aggregation without memory pressure from excessive metric series.


1521-1521: LGTM! Failure metric properly scoped.

Removing attributes allows tracking overall failure rates without creating per-message metric series.


1552-1552: LGTM! Retry metric simplified.

Recording retries without per-message attributes enables proper aggregation of retry rates across the system.


1608-1608: LGTM! DLQ metric properly scoped.

Removing attributes from the dead letter queue metric prevents cardinality explosion while maintaining the ability to track DLQ volume.


330-1608: All telemetry metric method calls across the codebase follow the expected pattern and do not pass high-cardinality attributes. Verification confirms:

  • recordEnqueue(), recordComplete(), recordFailure(), recordRetry(), and recordDLQ() are all called with no parameters
  • recordQueueTime(), recordProcessingTime(), and recordEnqueueBatch() are called only with their required value/count parameters
  • No attributes are passed to any of these methods in the entire codebase

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericallam ericallam changed the title fix(fair-queue): prevent cardinality explosion when reporting metrics, creating large event loop lag fix(fair-queue): Prevent unbounded memory growth from metrics cardinality explosion Dec 25, 2025
@vibe-kanban-cloud
Copy link

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

@ericallam ericallam merged commit edf5b14 into main Dec 25, 2025
32 checks passed
@ericallam ericallam deleted the ea-branch-112 branch December 25, 2025 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants