Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 8:26pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 8:26pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 8:26pm
rivet-site Ignored Ignored Preview Nov 4, 2025 8:26pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

Code Review: Decrease Runner Lost Timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds, which determines how long the system waits after the last ping before forcibly removing a runner from the database and evicting all actors.

Positive Observations

Documentation Added: Good addition of the comment noting the ping interval is 3s
Constant Extraction: Properly extracted RUNNER_PING_INTERVAL as a named constant (replacing magic number 1000)
Code Cleanup: Removed unnecessary blank lines in TypeScript
Formatting Fix: Fixed the getActorHibernationConfig formatting to match code style

Critical Concerns

1. Aggressive Timeout Ratio (HIGH PRIORITY)

The new timeout is only 5x the ping interval (15s timeout / 3s ping). This is quite aggressive and could lead to false positives in production:

  • Network hiccups: A brief network issue or packet loss could cause 5 consecutive pings to fail
  • GC pauses: Long garbage collection pauses in the runner could delay pings
  • CPU contention: Under heavy load, ping tasks might be delayed
  • Reconnection time: Runners may not have sufficient time to reconnect before being evicted

Recommendation: Consider a more conservative ratio. Industry best practices typically use 3-5 missed heartbeats, but with more buffer. A timeout of 30-45 seconds (10-15x ping interval) would be more robust.

2. Impact Analysis Missing

The PR doesn't explain:

  • Why the timeout is being reduced from 2 minutes to 15 seconds
  • What problem this change solves
  • What testing was performed to validate this is safe
  • What the failure mode looks like when a runner is incorrectly marked as lost

Question: What motivated this change? Is there a specific issue with runners staying in the system too long?

3. Relationship with RUNNER_ELIGIBLE_THRESHOLD

The code has two thresholds:

  • RUNNER_ELIGIBLE_THRESHOLD_MS = 10s - When a runner is considered ineligible for allocation
  • RUNNER_LOST_THRESHOLD_MS = 15s - When a runner is forcibly evicted

The window between "ineligible" and "lost" is now only 5 seconds. This seems tight.

Question: Is 5 seconds enough time for the system to react to an ineligible runner before forcibly evicting it?

4. Potential Race Conditions

Looking at engine/packages/pegboard-runner/src/ping_task.rs:11, pings are sent via tokio::time::sleep(UPDATE_PING_INTERVAL). This doesn't account for:

  • Time spent processing the ping
  • Network latency
  • Clock drift between client and server

With a tight 15s timeout, these factors become more significant.

5. Actor Eviction Impact

Per the comment at engine/packages/pegboard/src/workflows/runner.rs:16-17:

"forcibly removing a runner from the database and deleting its workflow, evicting all actors"

This is a destructive operation. If actors are evicted due to a false positive, it could:

  • Disrupt active workflows
  • Lose in-memory state
  • Cause cascading failures in dependent systems

Recommendation: Consider adding metrics to track how often runners are marked as lost, and monitor after deploying this change.

6. Missing Test Coverage

I don't see any test changes validating:

  • Runners are correctly evicted after 15s of no pings
  • Runners are NOT evicted if they ping within 15s
  • Edge cases around network issues and reconnection

Recommendation: Add integration tests for the timeout behavior.

Code Quality

TypeScript Changes (engine/sdks/typescript/runner/src/mod.ts):

  • ✅ Good constant extraction
  • ✅ Clean code style improvements
  • ⚠️ Consider documenting why 3s was chosen for ping interval

Rust Changes (engine/packages/pegboard/src/workflows/runner.rs):

  • ✅ Comment documenting ping interval
  • ⚠️ Could benefit from a comment explaining why 15s is appropriate

Security Considerations

No direct security issues, but the aggressive timeout could be exploited:

  • Attackers could cause network disruption to trigger false evictions
  • Could be used as a DoS vector if runners are frequently evicted and need to restart

Performance Considerations

  • ✅ Faster detection of truly lost runners (good for resource cleanup)
  • ⚠️ Potential increase in false positives leading to unnecessary actor restarts
  • ⚠️ Could increase system load if runners are frequently evicted and reconnected

Recommendations

  1. Add context: Update the PR description explaining why this change is needed
  2. Consider a less aggressive timeout: 30-45 seconds would be safer while still being responsive
  3. Add monitoring: Implement metrics for runner eviction rates
  4. Add tests: Cover timeout scenarios and edge cases
  5. Gradual rollout: Consider deploying this change gradually with monitoring
  6. Document the decision: Add comments explaining the timeout value choice and tradeoffs

Questions for the Author

  1. What specific problem is this change solving?
  2. Have you tested this in a staging environment with real network conditions?
  3. What is the expected impact on actor availability?
  4. Have you measured current runner reconnection times?
  5. Are there metrics showing how often runners currently take >15s to reconnect?

Verdict

⚠️ Changes Requested - The code changes are clean, but the timeout reduction seems too aggressive without more context and testing. Please provide justification for the 15s timeout value and consider a more conservative approach.


Review generated following CLAUDE.md conventions

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from a0d58a1 to 9fbbebe Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 7895f28 to 546e885 Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds and introduces a constant for the ping interval (3s) in the TypeScript SDK. The change aims to detect lost runners more quickly.

Code Quality ✅

Good:

  • Added helpful comment documenting the ping interval (3s) in the Rust code
  • Extracted hardcoded ping interval into a named constant (RUNNER_PING_INTERVAL) in TypeScript
  • Formatting improvements (removing extra blank line, reformatting function signature)

Minor Issues:

  • The TypeScript changes include unrelated formatting fixes that could have been in a separate commit for clarity

Potential Issues ⚠️

Critical: Timeout Safety Margin
The new 15s threshold provides only a 5x multiplier over the 3s ping interval. This is concerning:

// Rust side
const RUNNER_LOST_THRESHOLD_MS: i64 = util::duration::seconds(15);

// TypeScript side  
const RUNNER_PING_INTERVAL = 3_000; // 3s

Concerns:

  1. Network variability: In production, network delays, packet loss, or brief connectivity issues could easily cause 2-3 missed pings
  2. System load: Under high load, ping timing may become unreliable
  3. False positives: A temporarily slow network could trigger unnecessary runner evictions and actor rescheduling

Recommendations:

  • Consider a threshold of 30-45s (10-15x the ping interval) for better resilience
  • If 15s is intentional for testing/development, document the reasoning
  • Add monitoring for false-positive evictions

Performance Considerations 🔄

Positive:

  • Faster detection of genuinely lost runners reduces wasted resources
  • Actors can be rescheduled more quickly on healthy runners

Negative:

  • Increased risk of cascading failures during network hiccups
  • More frequent actor rescheduling could increase system load

Security Concerns ✅

No security issues identified. The change is purely operational.

Test Coverage ❓

Questions:

  • Were there tests validating the 2-minute threshold behavior?
  • Are there integration tests covering runner reconnection scenarios?
  • Has this been tested under network simulation (latency, packet loss)?

Suggestions:

  • Add tests for the edge case: runner disconnects at 14.9s, reconnects immediately
  • Test behavior during rolling network issues (intermittent connectivity)

Additional Observations

Code Consistency:
The Rust code uses a public constant RUNNER_ELIGIBLE_THRESHOLD_MS (10s) and private RUNNER_LOST_THRESHOLD_MS (15s). This 5-second gap seems intentional for hysteresis but should be documented.

Line 91 in runner.rs:

match ctx
    .listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS)
    .await?

This timeout is used correctly in the workflow loop.

Recommendations

  1. Document the rationale for choosing 15s in the PR description or code comments
  2. Consider increasing to 30-45s for production safety unless there's a specific need for aggressive timeouts
  3. Add metrics to track false-positive evictions vs. genuine failures
  4. Coordinate with SRE/DevOps to ensure this change aligns with infrastructure reliability
  5. Consider making this configurable per environment (dev vs. prod)

Verdict

The code changes are clean and well-structured, but the 15-second timeout seems risky for production use. I'd recommend either:

  • Increasing to 30-45s for better fault tolerance, or
  • Providing clear justification for the aggressive timeout

Otherwise, the implementation looks good! ✨

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Decrease Runner Timeout to 15s

I've reviewed this PR and have several observations and recommendations:

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds, while the ping interval remains at 3 seconds. This change makes the system more responsive to lost runners but introduces some risks.


Code Quality & Best Practices

✅ Positive aspects:

  1. Good documentation - The added comment explaining the ping interval (3s) helps clarify the relationship between ping frequency and timeout
  2. Constant extraction - Moving the hardcoded 1000 to RUNNER_PING_INTERVAL constant improves maintainability
  3. Code cleanup - Removing extraneous blank lines and improving formatting

⚠️ Areas for improvement:

1. Magic number ratio

The ratio between RUNNER_LOST_THRESHOLD_MS (15s) and RUNNER_PING_INTERVAL (3s) is 5:1. This means only 5 missed pings will cause a runner to be considered lost. Consider:

  • Making this ratio explicit and configurable
  • Adding a constant like RUNNER_LOST_PING_MULTIPLIER = 5
  • Or even better: RUNNER_LOST_THRESHOLD_MS = RUNNER_PING_INTERVAL * 5

This would make the relationship clearer and easier to adjust.

2. Cross-language coordination

The TypeScript SDK (mod.ts) defines RUNNER_PING_INTERVAL but the Rust workflow (runner.rs) hardcodes the relationship. If someone changes the TypeScript ping interval without updating the Rust threshold, the system could break. Consider:

  • Documenting this dependency more prominently in both files
  • Using a shared configuration mechanism if possible
  • Adding a build-time check or test to verify they stay in sync

Potential Bugs & Issues

1. Network instability tolerance ⚠️

With a 15-second timeout and 3-second pings, the system has very little tolerance for network hiccups:

  • 5 missed pings = lost runner
  • Network jitter, brief connectivity issues, or CPU contention could easily cause false positives
  • Previously at 2 minutes, there was 40x ping buffer

Recommendation: Monitor production metrics closely after deploying this change. Consider:

  • Adding metrics to track how often runners are marked as lost
  • Implementing exponential backoff or jitter in the ping mechanism
  • Having a grace period or warning state before marking a runner as lost

2. Race conditions during network recovery

If a runner reconnects after being marked lost (but before actors are rescheduled), there's potential for:

  • Duplicate actor allocations
  • State inconsistencies
  • Actor thrashing (rapid stop/start cycles)

The code at runner.rs:246-283 handles draining runners, but with a tighter timeout, edge cases are more likely.


Performance Considerations

✅ Positive impact:

  • Faster detection of genuinely lost runners
  • Quicker actor rescheduling
  • Better system responsiveness

⚠️ Potential concerns:

  1. Increased actor churn - More aggressive timeouts could lead to more frequent actor migrations
  2. Database load - More frequent runner state changes means more database writes
  3. Workflow overhead - More timeout events being processed in the workflow loop

Recommendation: Add observability metrics:

// Track how often timeouts actually fire
tracing::info!(runner_id=?input.runner_id, "runner lost threshold reached");

Security Concerns

No significant security issues identified. The change doesn't introduce new attack vectors.


Test Coverage

❌ Missing test coverage:

  • No tests found for runner timeout behavior
  • No integration tests for the ping/lost detection mechanism
  • No tests verifying the synchronization between TypeScript and Rust timeout values

Recommendation: Add tests for:

  1. Runner marked as lost after missing N pings
  2. Runner successfully reconnects before timeout
  3. Actors properly rescheduled when runner lost
  4. Edge case: runner reconnects during draining state

Example test structure:

#[tokio::test]
async fn test_runner_lost_after_timeout() {
    // Setup runner workflow
    // Stop sending pings
    // Wait RUNNER_LOST_THRESHOLD_MS + buffer
    // Assert runner marked as lost
    // Assert actors rescheduled
}

Additional Observations

1. Related code in mod.ts:255

The removed blank line is cosmetic but the reformatting of getActorHibernationConfig parameters looks like an unrelated formatting change. While not problematic, it would be cleaner to keep formatting changes separate from functional changes.

2. Consider making timeout configurable

Rather than hardcoding, consider making this a configuration parameter that can be adjusted per environment:

  • Development: longer timeout for debugging
  • Production: shorter timeout for responsiveness
  • Could be part of the RunnerConfig in TypeScript

Recommendations Summary

Before merging:

  1. ✅ Add explicit documentation about the 5:1 ratio between timeout and ping interval
  2. ✅ Add a constant expressing the multiplier relationship
  3. ⚠️ Consider adding basic test coverage for timeout behavior
  4. ⚠️ Add observability/metrics to track runner lost events

After merging:

  1. 📊 Monitor runner lost frequency in production
  2. 📊 Track actor rescheduling rates
  3. 📊 Watch for increased database write load
  4. 🔍 Be prepared to adjust the timeout if false positives occur

Conclusion

This is a reasonable change that makes the system more responsive, but it comes with tradeoffs. The decreased timeout window means less tolerance for transient network issues. The code quality is good, but test coverage is lacking. I recommend:

  1. Merge with caution - This is deployable but needs monitoring
  2. Add metrics - Track the actual impact in production
  3. Follow up - Add tests and consider making timeouts configurable
  4. Document - Add comments explaining the timeout/ping relationship

Overall assessment: ✅ Approve with recommendations for follow-up work.


Review generated with assistance from Claude Code

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds and extracts the ping interval (3s) as a constant in the TypeScript SDK. The changes aim to make runner failure detection more responsive.

Code Quality & Best Practices

Positive:

  • ✅ Good addition of documentation comment explaining the relationship between ping interval (3s) and timeout (15s)
  • ✅ Proper constant extraction (RUNNER_PING_INTERVAL) improves code maintainability
  • ✅ Whitespace cleanup is appropriate

Concerns:

  • ⚠️ Inconsistent constant definition: The ping interval is defined in two places:

    • TypeScript: RUNNER_PING_INTERVAL = 3_000 (line 12)
    • Rust comment: "Runner ping interval is currently set to 3s" (line 19)

    Consider extracting this to a shared protocol constant to maintain single source of truth.

Potential Issues

1. Timeout Safety Margin (Critical)

The current configuration has a 5x safety margin (15s timeout / 3s ping = 5 pings). However:

  • Network latency, processing delays, or temporary infrastructure issues could cause missed pings
  • With only 5 ping opportunities before timeout, transient network issues could cause unnecessary runner evictions
  • Recommendation: Consider a 30-45s timeout for a safer 10-15x margin, or document why 15s is sufficient for your infrastructure

2. Race Condition Risk

Reducing timeout from 2 minutes to 15 seconds significantly increases the risk of race conditions between:

  • Runner attempting to reconnect
  • Server marking runner as lost and evicting actors
  • Actor workflows receiving "Lost" signals

The code at runner.rs:324-334 handles timeout expiry, but there's no apparent grace period for reconnection attempts.

3. Impact on Actor Rescheduling

When a runner times out, all actors are marked as lost (runner.rs:357-382). With a 15s timeout:

  • More actors will be rescheduled unnecessarily due to transient issues
  • This could cause cascading effects if many runners timeout simultaneously
  • Consider the impact on your actor crash policies and rescheduling logic

Performance Considerations

  • Positive: Faster failure detection improves responsiveness
  • Concern: More aggressive timeout may increase control plane load due to:
    • More frequent runner evictions and re-registrations
    • More actor rescheduling operations
    • More database transactions for cleanup operations

Security Concerns

No security issues identified.

Test Coverage

Missing:

  • ❌ No tests verify the timeout behavior with the new value
  • ❌ No tests validate the relationship between ping interval and timeout threshold
  • ❌ No load tests to verify system behavior under transient network issues

Recommendations:

  1. Add integration test that simulates runner missing 4 pings (should stay alive) vs 5+ pings (should timeout)
  2. Add test verifying reconnection behavior within the timeout window
  3. Consider chaos testing with artificial network delays

Suggested Improvements

// In runner.rs, consider making these configurable or better documented:
/// Runner ping interval in milliseconds. This is set in the TypeScript SDK.
const RUNNER_PING_INTERVAL_MS: i64 = util::duration::seconds(3);

/// How long to wait after last ping before forcibly removing a runner.
/// Set to 5x the ping interval to allow for some missed pings due to transient issues.
const RUNNER_LOST_THRESHOLD_MS: i64 = RUNNER_PING_INTERVAL_MS * 5;

Questions for the Author

  1. What operational experience led to reducing this from 2 minutes to 15 seconds?
  2. Have you tested this under production-like network conditions with packet loss/delays?
  3. Is there monitoring in place to track false-positive runner evictions?
  4. Are there any automatic reconnection mechanisms that would benefit from a longer timeout?

Verdict

⚠️ Request Changes: While the code changes are clean, the 15-second timeout seems aggressive without additional safeguards. Consider:

  1. Increasing to 30-45 seconds, OR
  2. Adding comprehensive tests demonstrating stability, OR
  3. Implementing a grace period/reconnection mechanism, OR
  4. Providing metrics from testing that validate 15s is safe for your infrastructure

The current 5x safety margin is at the lower bound of what's typically recommended for distributed systems with network communication.


Review generated with assistance from Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 0423a08 to dcca850 Compare November 4, 2025 19:24
@NathanFlurry NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from d9e0eb8 to d2735a8 Compare November 4, 2025 19:24
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds, with a corresponding addition of documentation noting the 3-second ping interval. The TypeScript runner SDK also refactors the ping interval to use a named constant.


Code Quality & Best Practices

Positive:

  • ✅ Good documentation added explaining the relationship between ping interval (3s) and timeout (15s)
  • ✅ Magic number elimination in TypeScript by introducing RUNNER_PING_INTERVAL constant
  • ✅ Consistent timeout-to-ping ratio (15s timeout / 3s ping = 5 missed pings before timeout)

Minor formatting improvements:

  • The TypeScript changes include some formatting updates (function signature alignment) that are reasonable but could have been in a separate commit for clearer history

Potential Issues & Concerns

🔴 Critical: Aggressive timeout may cause false positives

The 15-second timeout is very aggressive and could lead to false positive runner evictions in the following scenarios:

  1. Network hiccups: Brief network interruptions (>15s) will now cause runner eviction
  2. System load: If the runner process is temporarily CPU-starved or experiences GC pauses, it may miss multiple pings
  3. WebSocket delays: WS connection issues that previously would have self-recovered within 2 minutes will now trigger eviction

Risk analysis:

  • Previous timeout: 2 minutes (40 missed pings)
  • New timeout: 15 seconds (5 missed pings)
  • This is a 92.5% reduction in tolerance

Recommendation: Consider a middle ground like 30-45 seconds (10-15 missed pings) to balance responsiveness with reliability.


⚠️ Moderate: Missing context on why this change is needed

The PR description is empty, making it difficult to understand:

  • What problem does this solve?
  • Has this been tested in production-like conditions?
  • Were there specific incidents that motivated this change?
  • What's the expected impact on runner stability?

⚠️ Moderate: Consider adding metrics/alerting

With such an aggressive timeout, you should consider:

  1. Adding metrics to track runner eviction reasons (timeout vs. graceful shutdown)
  2. Alerting on abnormally high eviction rates
  3. Logging when runners are evicted due to timeout vs. other reasons

From the code at runner.rs:324-333, the timeout handler doesn't distinguish between network issues and actual runner failures.


Performance Considerations

Ping frequency impact:

  • Ping interval: 3 seconds
  • Ping overhead: Minimal (protocol message is small)
  • Network impact: Acceptable for most deployments

Database queries on timeout:

  • The check_expired activity (lines 939-964) runs a DB query when timeout occurs
  • With 15s timeout, this could increase DB load if many runners have connection issues
  • Consider: Does RUNNER_ELIGIBLE_THRESHOLD_MS (10s) need adjustment too? It's close to the lost threshold now

Security Considerations

No significant security concerns. The change makes the system more aggressive in evicting potentially-dead runners, which could be viewed as improving resource management but also as increasing the attack surface for DoS via connection disruption.


Test Coverage

Missing:

  • No test changes included
  • Consider adding tests for:
    • Runner eviction at exactly 15s timeout
    • Actor rescheduling behavior when runner times out
    • Recovery behavior if runner reconnects after eviction

Existing coverage:
The workflow at runner.rs:91 uses listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS) which should respect the new timeout, but edge cases should be tested.


Additional Observations

Rust side (runner.rs:20):

/// Runner ping interval is currently set to 3s.
const RUNNER_LOST_THRESHOLD_MS: i64 = util::duration::seconds(15);

TypeScript side (mod.ts:532):

}, RUNNER_PING_INTERVAL);

✅ The constant is properly synchronized between both sides.

Consider:

  1. Should RUNNER_PING_INTERVAL be configurable via RunnerConfig?
  2. Should the timeout be configurable per deployment (dev vs. prod)?
  3. Document the timeout in the protocol metadata (it's already sent at line 119: runner_lost_threshold: RUNNER_LOST_THRESHOLD_MS)

Recommendations

Before merging:

  1. 🔴 Test in staging environment with representative network conditions
  2. 🔴 Add PR description explaining the motivation and expected impact
  3. ⚠️ Consider increasing timeout to 30-45s for safety
  4. ⚠️ Add metrics/logging to track eviction rates
  5. ⚠️ Document rollback plan if this causes issues in production

Future improvements:

  • Consider adaptive timeouts based on historical ping latency
  • Add health checks beyond just ping/pong
  • Consider exponential backoff for reconnection attempts

Verdict

⚠️ Approve with concerns - The code changes are clean and well-structured, but the 15-second timeout is very aggressive and needs careful production validation. Please address the recommendations above, particularly around testing and monitoring.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Decrease Runner Timeout to 15s

Overview

This PR reduces the runner lost threshold from 2 minutes to 15 seconds, aligning it more closely with the 3-second ping interval. The changes span both the Rust backend (Pegboard) and TypeScript SDK.


Code Quality & Best Practices

Positive Observations

  1. Good Documentation: The addition of the comment in runner.rs:19 explaining that "Runner ping interval is currently set to 3s" provides helpful context for why 15s is an appropriate threshold.

  2. Consistency: The constant RUNNER_PING_INTERVAL was properly extracted in the TypeScript code (line 12), replacing the hardcoded 1000 value. This improves maintainability.

  3. Clean Formatting Fix: Removed the unnecessary blank line at mod.ts:258 and properly formatted the getActorHibernationConfig function signature.

  4. Follows Conventions: The commit message follows the conventional commit format specified in CLAUDE.md: chore(runner): decrease timeout to 15s.


Potential Issues & Concerns

⚠️ Critical: Timeout Calculation Concern

The new 15-second threshold creates a tight margin:

  • Ping interval: 3 seconds
  • Lost threshold: 15 seconds
  • Theoretical pings before timeout: 5 pings

Concerns:

  1. Network Reliability: In production environments with network jitter, packet loss, or temporary connectivity issues, 5 missed pings (15s) might be too aggressive. This could lead to false positives where healthy runners are marked as lost.

  2. No Buffer for System Load: If the runner process is under heavy load (GC pause, CPU contention), it might miss several ping intervals, causing premature eviction.

  3. Consider the Math:

    • Old threshold: 120s / 3s = ~40 pings of tolerance
    • New threshold: 15s / 3s = 5 pings of tolerance

    This is a 8x reduction in tolerance, which is quite aggressive.

Recommendation: Consider increasing to 30-45 seconds (10-15 pings) to balance responsiveness with stability. Alternatively, add telemetry to monitor false positives before deploying this change.

⚠️ Missing Test Coverage

The PR doesn't include any tests for the new timeout behavior. Consider adding:

  1. Tests to verify the 15-second threshold is correctly enforced
  2. Tests to ensure runners aren't prematurely evicted
  3. Integration tests for the ping mechanism with the new timing

According to CLAUDE.md's testing guidelines, you should pipe test output to /tmp/ and grep for verification.

📝 Documentation Considerations

  1. Constant Synchronization Risk: The ping interval (3s) is defined in TypeScript, but the timeout (15s) is in Rust. If someone changes the TypeScript ping interval without updating the Rust constant, the system could break. Consider:

    • Adding a comment in both files cross-referencing each other
    • Better yet, having the server communicate the timeout to the client during initialization (I see this is already done via ProtocolMetadata at line 119, which is good!)
  2. Missing Rationale: The PR body is empty. It would be helpful to understand:

    • Why was this change needed?
    • What problem does it solve?
    • Were there production issues with the 2-minute timeout?

Performance Considerations

Positive Impact:

  • Faster detection of lost runners means actors can be rescheduled more quickly, improving system resilience.
  • Less resource waste from zombie runners.

⚠️ Potential Negative Impact:

  • More aggressive timeouts could increase churn and cause unnecessary actor migrations if the threshold is too tight.

Security Concerns

✅ No security issues identified. The change is purely operational timing.


Architecture Alignment

Looking at the code at runner.rs:119, I can see that RUNNER_LOST_THRESHOLD_MS is communicated to the client via ProtocolMetadata. This is good architecture - the client can adapt to server-side timeout changes. However, the TypeScript client should ideally use this value to adjust its ping interval dynamically rather than having it hardcoded.


Recommendations

  1. Consider a more conservative timeout (30-45s) or add metrics/monitoring before rolling out to production
  2. Add test coverage for the new timeout behavior
  3. Add telemetry to track how often the timeout is triggered to validate the new threshold
  4. Document the rationale in the PR description
  5. Consider making the ping interval and timeout configurable rather than hardcoded constants

Summary

The code changes are clean and well-structured, but the 15-second timeout appears aggressive for production use. The 8x reduction in tolerance (from 120s to 15s) significantly increases the risk of false positives. I recommend gathering data on actual ping latencies in production before merging, or starting with a more conservative value like 30-45 seconds.

Verdict: ⚠️ Request Changes - Consider a less aggressive timeout or add supporting data/metrics to validate this threshold.


Review generated with context from CLAUDE.md and codebase conventions.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:52 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:53 PM UTC: CI is running for this pull request on a draft pull request (#3353) due to your merge queue CI optimization settings.
  • Nov 4, 8:56 PM UTC: Merged by the Graphite merge queue via draft PR: #3353.

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025
@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from dcca850 to e4d3b77 Compare November 4, 2025 20:25
@NathanFlurry NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from d2735a8 to a114914 Compare November 4, 2025 20:25
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: chore(runner): decrease timeout to 15s

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds and consolidates the ping interval constant to a single named constant (RUNNER_PING_INTERVAL). The change makes the system more responsive to runner failures.

Code Quality & Best Practices

Good:

  • Added helpful documentation comment explaining the relationship between ping interval (3s) and timeout (15s)
  • Consolidated the ping interval into a named constant (RUNNER_PING_INTERVAL = 3_000), improving maintainability
  • Follows existing patterns in the codebase for timeout constants
  • Removed unnecessary blank lines (formatting cleanup)

⚠️ Considerations:

  1. Timeout/Ping Ratio: The new 15s timeout with a 3s ping interval gives you a 5x multiplier (15s / 3s = 5 pings). This is quite aggressive compared to the previous 40x multiplier (120s / 3s = 40 pings).

    • Risk: If there's any network hiccup causing 5 consecutive ping failures, runners will be marked as lost
    • Recommendation: Consider if 5x is sufficient buffer for network reliability in production. Common practice is 3-5x for fast fail-over or 10-15x for stability. You might want to track metrics on false positives after deployment.
  2. TypeScript Formatting Change: The PR includes a formatting change to getActorHibernationConfig parameters that appears unrelated to the timeout change. While this is fine, it's worth noting this was likely from an auto-formatter run.

Performance Considerations

Positive Impact:

  • Faster detection of lost runners (15s vs 2m) means:
    • Actors can be rescheduled more quickly
    • Reduced resource waste on "zombie" runners
    • Better user experience with faster recovery

⚠️ Potential Impact:

  • More frequent runner churn if network is unstable
  • Increased database operations from more frequent runner state transitions
  • More actor rescheduling operations if false positives occur

Recommendation: Monitor these metrics post-deployment:

  • Runner expiration rate
  • Actor rescheduling frequency
  • Average runner lifetime
  • False positive runner expirations (runners that reconnect after being marked lost)

Security Concerns

✅ No security concerns identified. The change is purely operational/timing related.

Test Coverage

⚠️ Missing:

  • No tests included in this PR
  • Consider adding a test case that verifies timeout behavior:
    • Runner marked as lost after RUNNER_LOST_THRESHOLD_MS without ping
    • Runner NOT marked as lost if pings arrive within threshold
    • Integration test for the 15s timeout in the workflow loop (line 91: listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS))

Additional Notes

  1. Constant Naming Consistency: The Rust constant is RUNNER_LOST_THRESHOLD_MS while the TypeScript constant is RUNNER_PING_INTERVAL. Consider if RUNNER_PING_INTERVAL_MS would be more consistent (though the _MS suffix is less necessary for the interval constant).

  2. Protocol Metadata: The timeout value is correctly communicated to clients via protocol::ProtocolMetadata (line 119), so clients will be aware of the new threshold.

  3. Related Constants: The RUNNER_ELIGIBLE_THRESHOLD_MS = 10s (line 15) remains unchanged. This means:

    • After 10s without ping: ineligible for new allocations
    • After 15s without ping: marked as lost and removed
    • This 5s gap seems reasonable for graceful degradation.

Overall Assessment

Verdict: ✅ Approve with monitoring recommendation

The code changes are clean and well-implemented. The aggressive timeout reduction is a significant operational change that should be monitored closely in production. The 5x safety margin is on the lower end but may be appropriate for your use case if low-latency recovery is prioritized over stability.

Action Items:

  1. ✅ Code looks good - approve
  2. ⚠️ Add monitoring/alerts for runner expiration rates post-deployment
  3. 💡 Consider adding integration tests for timeout behavior
  4. 💡 Track false positive metrics to validate 15s is the right threshold

🤖 Generated with Claude Code

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025
@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-chore_runner_decrease_timeout_to_15s branch November 4, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants