chore(runner): decrease timeout to 15s #3328

NathanFlurry · 2025-11-03T22:20:44Z

No description provided.

vercel · 2025-11-03T22:20:50Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Ready	Preview	Comment	Nov 4, 2025 8:26pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 4, 2025 8:26pm
rivet-inspector	Ignored	Preview	Nov 4, 2025 8:26pm
rivet-site	Ignored	Preview	Nov 4, 2025 8:26pm

NathanFlurry · 2025-11-03T22:21:04Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-03T22:26:53Z

Code Review: Decrease Runner Lost Timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds, which determines how long the system waits after the last ping before forcibly removing a runner from the database and evicting all actors.

Positive Observations

✅ Documentation Added: Good addition of the comment noting the ping interval is 3s
✅ Constant Extraction: Properly extracted RUNNER_PING_INTERVAL as a named constant (replacing magic number 1000)
✅ Code Cleanup: Removed unnecessary blank lines in TypeScript
✅ Formatting Fix: Fixed the getActorHibernationConfig formatting to match code style

Critical Concerns

1. Aggressive Timeout Ratio (HIGH PRIORITY)

The new timeout is only 5x the ping interval (15s timeout / 3s ping). This is quite aggressive and could lead to false positives in production:

Network hiccups: A brief network issue or packet loss could cause 5 consecutive pings to fail
GC pauses: Long garbage collection pauses in the runner could delay pings
CPU contention: Under heavy load, ping tasks might be delayed
Reconnection time: Runners may not have sufficient time to reconnect before being evicted

Recommendation: Consider a more conservative ratio. Industry best practices typically use 3-5 missed heartbeats, but with more buffer. A timeout of 30-45 seconds (10-15x ping interval) would be more robust.

2. Impact Analysis Missing

The PR doesn't explain:

Why the timeout is being reduced from 2 minutes to 15 seconds
What problem this change solves
What testing was performed to validate this is safe
What the failure mode looks like when a runner is incorrectly marked as lost

Question: What motivated this change? Is there a specific issue with runners staying in the system too long?

3. Relationship with RUNNER_ELIGIBLE_THRESHOLD

The code has two thresholds:

RUNNER_ELIGIBLE_THRESHOLD_MS = 10s - When a runner is considered ineligible for allocation
RUNNER_LOST_THRESHOLD_MS = 15s - When a runner is forcibly evicted

The window between "ineligible" and "lost" is now only 5 seconds. This seems tight.

Question: Is 5 seconds enough time for the system to react to an ineligible runner before forcibly evicting it?

4. Potential Race Conditions

Looking at engine/packages/pegboard-runner/src/ping_task.rs:11, pings are sent via tokio::time::sleep(UPDATE_PING_INTERVAL). This doesn't account for:

Time spent processing the ping
Network latency
Clock drift between client and server

With a tight 15s timeout, these factors become more significant.

5. Actor Eviction Impact

Per the comment at engine/packages/pegboard/src/workflows/runner.rs:16-17:

"forcibly removing a runner from the database and deleting its workflow, evicting all actors"

This is a destructive operation. If actors are evicted due to a false positive, it could:

Disrupt active workflows
Lose in-memory state
Cause cascading failures in dependent systems

Recommendation: Consider adding metrics to track how often runners are marked as lost, and monitor after deploying this change.

6. Missing Test Coverage

I don't see any test changes validating:

Runners are correctly evicted after 15s of no pings
Runners are NOT evicted if they ping within 15s
Edge cases around network issues and reconnection

Recommendation: Add integration tests for the timeout behavior.

Code Quality

TypeScript Changes (engine/sdks/typescript/runner/src/mod.ts):

✅ Good constant extraction
✅ Clean code style improvements
⚠️ Consider documenting why 3s was chosen for ping interval

Rust Changes (engine/packages/pegboard/src/workflows/runner.rs):

✅ Comment documenting ping interval
⚠️ Could benefit from a comment explaining why 15s is appropriate

Security Considerations

No direct security issues, but the aggressive timeout could be exploited:

Attackers could cause network disruption to trigger false evictions
Could be used as a DoS vector if runners are frequently evicted and need to restart

Performance Considerations

✅ Faster detection of truly lost runners (good for resource cleanup)
⚠️ Potential increase in false positives leading to unnecessary actor restarts
⚠️ Could increase system load if runners are frequently evicted and reconnected

Recommendations

Add context: Update the PR description explaining why this change is needed
Consider a less aggressive timeout: 30-45 seconds would be safer while still being responsive
Add monitoring: Implement metrics for runner eviction rates
Add tests: Cover timeout scenarios and edge cases
Gradual rollout: Consider deploying this change gradually with monitoring
Document the decision: Add comments explaining the timeout value choice and tradeoffs

Questions for the Author

What specific problem is this change solving?
Have you tested this in a staging environment with real network conditions?
What is the expected impact on actor availability?
Have you measured current runner reconnection times?
Are there metrics showing how often runners currently take >15s to reconnect?

Verdict

⚠️ Changes Requested - The code changes are clean, but the timeout reduction seems too aggressive without more context and testing. Please provide justification for the 15s timeout value and consider a more conservative approach.

Review generated following CLAUDE.md conventions

claude · 2025-11-04T00:41:28Z

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds and introduces a constant for the ping interval (3s) in the TypeScript SDK. The change aims to detect lost runners more quickly.

Code Quality ✅

Good:

Added helpful comment documenting the ping interval (3s) in the Rust code
Extracted hardcoded ping interval into a named constant (RUNNER_PING_INTERVAL) in TypeScript
Formatting improvements (removing extra blank line, reformatting function signature)

Minor Issues:

The TypeScript changes include unrelated formatting fixes that could have been in a separate commit for clarity

Potential Issues ⚠️

Critical: Timeout Safety Margin
The new 15s threshold provides only a 5x multiplier over the 3s ping interval. This is concerning:

// Rust side
const RUNNER_LOST_THRESHOLD_MS: i64 = util::duration::seconds(15);

// TypeScript side  
const RUNNER_PING_INTERVAL = 3_000; // 3s

Concerns:

Network variability: In production, network delays, packet loss, or brief connectivity issues could easily cause 2-3 missed pings
System load: Under high load, ping timing may become unreliable
False positives: A temporarily slow network could trigger unnecessary runner evictions and actor rescheduling

Recommendations:

Consider a threshold of 30-45s (10-15x the ping interval) for better resilience
If 15s is intentional for testing/development, document the reasoning
Add monitoring for false-positive evictions

Performance Considerations 🔄

Positive:

Faster detection of genuinely lost runners reduces wasted resources
Actors can be rescheduled more quickly on healthy runners

Negative:

Increased risk of cascading failures during network hiccups
More frequent actor rescheduling could increase system load

Security Concerns ✅

No security issues identified. The change is purely operational.

Test Coverage ❓

Questions:

Were there tests validating the 2-minute threshold behavior?
Are there integration tests covering runner reconnection scenarios?
Has this been tested under network simulation (latency, packet loss)?

Suggestions:

Add tests for the edge case: runner disconnects at 14.9s, reconnects immediately
Test behavior during rolling network issues (intermittent connectivity)

Additional Observations

Code Consistency:
The Rust code uses a public constant RUNNER_ELIGIBLE_THRESHOLD_MS (10s) and private RUNNER_LOST_THRESHOLD_MS (15s). This 5-second gap seems intentional for hysteresis but should be documented.

Line 91 in runner.rs:

match ctx
    .listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS)
    .await?

This timeout is used correctly in the workflow loop.

Recommendations

Document the rationale for choosing 15s in the PR description or code comments
Consider increasing to 30-45s for production safety unless there's a specific need for aggressive timeouts
Add metrics to track false-positive evictions vs. genuine failures
Coordinate with SRE/DevOps to ensure this change aligns with infrastructure reliability
Consider making this configurable per environment (dev vs. prod)

Verdict

The code changes are clean and well-structured, but the 15-second timeout seems risky for production use. I'd recommend either:

Increasing to 30-45s for better fault tolerance, or
Providing clear justification for the aggressive timeout

Otherwise, the implementation looks good! ✨

claude · 2025-11-04T00:42:58Z

PR Review: Decrease Runner Timeout to 15s

I've reviewed this PR and have several observations and recommendations:

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds, while the ping interval remains at 3 seconds. This change makes the system more responsive to lost runners but introduces some risks.

Code Quality & Best Practices

✅ Positive aspects:

Good documentation - The added comment explaining the ping interval (3s) helps clarify the relationship between ping frequency and timeout
Constant extraction - Moving the hardcoded 1000 to RUNNER_PING_INTERVAL constant improves maintainability
Code cleanup - Removing extraneous blank lines and improving formatting

⚠️ Areas for improvement:

1. Magic number ratio

The ratio between RUNNER_LOST_THRESHOLD_MS (15s) and RUNNER_PING_INTERVAL (3s) is 5:1. This means only 5 missed pings will cause a runner to be considered lost. Consider:

Making this ratio explicit and configurable
Adding a constant like RUNNER_LOST_PING_MULTIPLIER = 5
Or even better: RUNNER_LOST_THRESHOLD_MS = RUNNER_PING_INTERVAL * 5

This would make the relationship clearer and easier to adjust.

2. Cross-language coordination

The TypeScript SDK (mod.ts) defines RUNNER_PING_INTERVAL but the Rust workflow (runner.rs) hardcodes the relationship. If someone changes the TypeScript ping interval without updating the Rust threshold, the system could break. Consider:

Documenting this dependency more prominently in both files
Using a shared configuration mechanism if possible
Adding a build-time check or test to verify they stay in sync

Potential Bugs & Issues

1. Network instability tolerance ⚠️

With a 15-second timeout and 3-second pings, the system has very little tolerance for network hiccups:

5 missed pings = lost runner
Network jitter, brief connectivity issues, or CPU contention could easily cause false positives
Previously at 2 minutes, there was 40x ping buffer

Recommendation: Monitor production metrics closely after deploying this change. Consider:

Adding metrics to track how often runners are marked as lost
Implementing exponential backoff or jitter in the ping mechanism
Having a grace period or warning state before marking a runner as lost

2. Race conditions during network recovery

If a runner reconnects after being marked lost (but before actors are rescheduled), there's potential for:

Duplicate actor allocations
State inconsistencies
Actor thrashing (rapid stop/start cycles)

The code at runner.rs:246-283 handles draining runners, but with a tighter timeout, edge cases are more likely.

Performance Considerations

✅ Positive impact:

Faster detection of genuinely lost runners
Quicker actor rescheduling
Better system responsiveness

⚠️ Potential concerns:

Increased actor churn - More aggressive timeouts could lead to more frequent actor migrations
Database load - More frequent runner state changes means more database writes
Workflow overhead - More timeout events being processed in the workflow loop

Recommendation: Add observability metrics:

// Track how often timeouts actually fire
tracing::info!(runner_id=?input.runner_id, "runner lost threshold reached");

Security Concerns

No significant security issues identified. The change doesn't introduce new attack vectors.

Test Coverage

❌ Missing test coverage:

No tests found for runner timeout behavior
No integration tests for the ping/lost detection mechanism
No tests verifying the synchronization between TypeScript and Rust timeout values

Recommendation: Add tests for:

Runner marked as lost after missing N pings
Runner successfully reconnects before timeout
Actors properly rescheduled when runner lost
Edge case: runner reconnects during draining state

Example test structure:

#[tokio::test]
async fn test_runner_lost_after_timeout() {
    // Setup runner workflow
    // Stop sending pings
    // Wait RUNNER_LOST_THRESHOLD_MS + buffer
    // Assert runner marked as lost
    // Assert actors rescheduled
}

Additional Observations

1. Related code in `mod.ts:255`

The removed blank line is cosmetic but the reformatting of getActorHibernationConfig parameters looks like an unrelated formatting change. While not problematic, it would be cleaner to keep formatting changes separate from functional changes.

2. Consider making timeout configurable

Rather than hardcoding, consider making this a configuration parameter that can be adjusted per environment:

Development: longer timeout for debugging
Production: shorter timeout for responsiveness
Could be part of the RunnerConfig in TypeScript

Recommendations Summary

Before merging:

✅ Add explicit documentation about the 5:1 ratio between timeout and ping interval
✅ Add a constant expressing the multiplier relationship
⚠️ Consider adding basic test coverage for timeout behavior
⚠️ Add observability/metrics to track runner lost events

After merging:

📊 Monitor runner lost frequency in production
📊 Track actor rescheduling rates
📊 Watch for increased database write load
🔍 Be prepared to adjust the timeout if false positives occur

Conclusion

This is a reasonable change that makes the system more responsive, but it comes with tradeoffs. The decreased timeout window means less tolerance for transient network issues. The code quality is good, but test coverage is lacking. I recommend:

Merge with caution - This is deployable but needs monitoring
Add metrics - Track the actual impact in production
Follow up - Add tests and consider making timeouts configurable
Document - Add comments explaining the timeout/ping relationship

Overall assessment: ✅ Approve with recommendations for follow-up work.

Review generated with assistance from Claude Code

claude · 2025-11-04T19:01:29Z

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds and extracts the ping interval (3s) as a constant in the TypeScript SDK. The changes aim to make runner failure detection more responsive.

Code Quality & Best Practices

Positive:

✅ Good addition of documentation comment explaining the relationship between ping interval (3s) and timeout (15s)
✅ Proper constant extraction (RUNNER_PING_INTERVAL) improves code maintainability
✅ Whitespace cleanup is appropriate

Concerns:

⚠️ Inconsistent constant definition: The ping interval is defined in two places:
- TypeScript: RUNNER_PING_INTERVAL = 3_000 (line 12)
- Rust comment: "Runner ping interval is currently set to 3s" (line 19)
Consider extracting this to a shared protocol constant to maintain single source of truth.

Potential Issues

1. Timeout Safety Margin (Critical)

The current configuration has a 5x safety margin (15s timeout / 3s ping = 5 pings). However:

Network latency, processing delays, or temporary infrastructure issues could cause missed pings
With only 5 ping opportunities before timeout, transient network issues could cause unnecessary runner evictions
Recommendation: Consider a 30-45s timeout for a safer 10-15x margin, or document why 15s is sufficient for your infrastructure

2. Race Condition Risk

Reducing timeout from 2 minutes to 15 seconds significantly increases the risk of race conditions between:

Runner attempting to reconnect
Server marking runner as lost and evicting actors
Actor workflows receiving "Lost" signals

The code at runner.rs:324-334 handles timeout expiry, but there's no apparent grace period for reconnection attempts.

3. Impact on Actor Rescheduling

When a runner times out, all actors are marked as lost (runner.rs:357-382). With a 15s timeout:

More actors will be rescheduled unnecessarily due to transient issues
This could cause cascading effects if many runners timeout simultaneously
Consider the impact on your actor crash policies and rescheduling logic

Performance Considerations

Positive: Faster failure detection improves responsiveness
Concern: More aggressive timeout may increase control plane load due to:
- More frequent runner evictions and re-registrations
- More actor rescheduling operations
- More database transactions for cleanup operations

Security Concerns

No security issues identified.

Test Coverage

Missing:

❌ No tests verify the timeout behavior with the new value
❌ No tests validate the relationship between ping interval and timeout threshold
❌ No load tests to verify system behavior under transient network issues

Recommendations:

Add integration test that simulates runner missing 4 pings (should stay alive) vs 5+ pings (should timeout)
Add test verifying reconnection behavior within the timeout window
Consider chaos testing with artificial network delays

Suggested Improvements

// In runner.rs, consider making these configurable or better documented:
/// Runner ping interval in milliseconds. This is set in the TypeScript SDK.
const RUNNER_PING_INTERVAL_MS: i64 = util::duration::seconds(3);

/// How long to wait after last ping before forcibly removing a runner.
/// Set to 5x the ping interval to allow for some missed pings due to transient issues.
const RUNNER_LOST_THRESHOLD_MS: i64 = RUNNER_PING_INTERVAL_MS * 5;

Questions for the Author

What operational experience led to reducing this from 2 minutes to 15 seconds?
Have you tested this under production-like network conditions with packet loss/delays?
Is there monitoring in place to track false-positive runner evictions?
Are there any automatic reconnection mechanisms that would benefit from a longer timeout?

Verdict

⚠️ Request Changes: While the code changes are clean, the 15-second timeout seems aggressive without additional safeguards. Consider:

Increasing to 30-45 seconds, OR
Adding comprehensive tests demonstrating stability, OR
Implementing a grace period/reconnection mechanism, OR
Providing metrics from testing that validate 15s is safe for your infrastructure

The current 5x safety margin is at the lower bound of what's typically recommended for distributed systems with network communication.

Review generated with assistance from Claude Code

claude · 2025-11-04T19:26:30Z

PR Review: Decrease runner timeout to 15s

Summary

This PR reduces the RUNNER_LOST_THRESHOLD_MS from 2 minutes to 15 seconds, with a corresponding addition of documentation noting the 3-second ping interval. The TypeScript runner SDK also refactors the ping interval to use a named constant.

Code Quality & Best Practices

Positive:

✅ Good documentation added explaining the relationship between ping interval (3s) and timeout (15s)
✅ Magic number elimination in TypeScript by introducing RUNNER_PING_INTERVAL constant
✅ Consistent timeout-to-ping ratio (15s timeout / 3s ping = 5 missed pings before timeout)

Minor formatting improvements:

The TypeScript changes include some formatting updates (function signature alignment) that are reasonable but could have been in a separate commit for clearer history

Potential Issues & Concerns

🔴 Critical: Aggressive timeout may cause false positives

The 15-second timeout is very aggressive and could lead to false positive runner evictions in the following scenarios:

Network hiccups: Brief network interruptions (>15s) will now cause runner eviction
System load: If the runner process is temporarily CPU-starved or experiences GC pauses, it may miss multiple pings
WebSocket delays: WS connection issues that previously would have self-recovered within 2 minutes will now trigger eviction

Risk analysis:

Previous timeout: 2 minutes (40 missed pings)
New timeout: 15 seconds (5 missed pings)
This is a 92.5% reduction in tolerance

Recommendation: Consider a middle ground like 30-45 seconds (10-15 missed pings) to balance responsiveness with reliability.

⚠️ Moderate: Missing context on why this change is needed

The PR description is empty, making it difficult to understand:

What problem does this solve?
Has this been tested in production-like conditions?
Were there specific incidents that motivated this change?
What's the expected impact on runner stability?

⚠️ Moderate: Consider adding metrics/alerting

With such an aggressive timeout, you should consider:

Adding metrics to track runner eviction reasons (timeout vs. graceful shutdown)
Alerting on abnormally high eviction rates
Logging when runners are evicted due to timeout vs. other reasons

From the code at runner.rs:324-333, the timeout handler doesn't distinguish between network issues and actual runner failures.

Performance Considerations

Ping frequency impact:

Ping interval: 3 seconds
Ping overhead: Minimal (protocol message is small)
Network impact: Acceptable for most deployments

Database queries on timeout:

The check_expired activity (lines 939-964) runs a DB query when timeout occurs
With 15s timeout, this could increase DB load if many runners have connection issues
Consider: Does RUNNER_ELIGIBLE_THRESHOLD_MS (10s) need adjustment too? It's close to the lost threshold now

Security Considerations

No significant security concerns. The change makes the system more aggressive in evicting potentially-dead runners, which could be viewed as improving resource management but also as increasing the attack surface for DoS via connection disruption.

Test Coverage

Missing:

No test changes included
Consider adding tests for:
- Runner eviction at exactly 15s timeout
- Actor rescheduling behavior when runner times out
- Recovery behavior if runner reconnects after eviction

Existing coverage:
The workflow at runner.rs:91 uses listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS) which should respect the new timeout, but edge cases should be tested.

Additional Observations

Rust side (runner.rs:20):

/// Runner ping interval is currently set to 3s.
const RUNNER_LOST_THRESHOLD_MS: i64 = util::duration::seconds(15);

TypeScript side (mod.ts:532):

}, RUNNER_PING_INTERVAL);

✅ The constant is properly synchronized between both sides.

Consider:

Should RUNNER_PING_INTERVAL be configurable via RunnerConfig?
Should the timeout be configurable per deployment (dev vs. prod)?
Document the timeout in the protocol metadata (it's already sent at line 119: runner_lost_threshold: RUNNER_LOST_THRESHOLD_MS)

Recommendations

Before merging:

🔴 Test in staging environment with representative network conditions
🔴 Add PR description explaining the motivation and expected impact
⚠️ Consider increasing timeout to 30-45s for safety
⚠️ Add metrics/logging to track eviction rates
⚠️ Document rollback plan if this causes issues in production

Future improvements:

Consider adaptive timeouts based on historical ping latency
Add health checks beyond just ping/pong
Consider exponential backoff for reconnection attempts

Verdict

⚠️ Approve with concerns - The code changes are clean and well-structured, but the 15-second timeout is very aggressive and needs careful production validation. Please address the recommendations above, particularly around testing and monitoring.

claude · 2025-11-04T19:32:35Z

PR Review: Decrease Runner Timeout to 15s

Overview

This PR reduces the runner lost threshold from 2 minutes to 15 seconds, aligning it more closely with the 3-second ping interval. The changes span both the Rust backend (Pegboard) and TypeScript SDK.

Code Quality & Best Practices

✅ Positive Observations

Good Documentation: The addition of the comment in runner.rs:19 explaining that "Runner ping interval is currently set to 3s" provides helpful context for why 15s is an appropriate threshold.
Consistency: The constant RUNNER_PING_INTERVAL was properly extracted in the TypeScript code (line 12), replacing the hardcoded 1000 value. This improves maintainability.
Clean Formatting Fix: Removed the unnecessary blank line at mod.ts:258 and properly formatted the getActorHibernationConfig function signature.
Follows Conventions: The commit message follows the conventional commit format specified in CLAUDE.md: chore(runner): decrease timeout to 15s.

Potential Issues & Concerns

⚠️ Critical: Timeout Calculation Concern

The new 15-second threshold creates a tight margin:

Ping interval: 3 seconds
Lost threshold: 15 seconds
Theoretical pings before timeout: 5 pings

Concerns:

Network Reliability: In production environments with network jitter, packet loss, or temporary connectivity issues, 5 missed pings (15s) might be too aggressive. This could lead to false positives where healthy runners are marked as lost.
No Buffer for System Load: If the runner process is under heavy load (GC pause, CPU contention), it might miss several ping intervals, causing premature eviction.
Consider the Math:
- Old threshold: 120s / 3s = ~40 pings of tolerance
- New threshold: 15s / 3s = 5 pings of tolerance
This is a 8x reduction in tolerance, which is quite aggressive.

Recommendation: Consider increasing to 30-45 seconds (10-15 pings) to balance responsiveness with stability. Alternatively, add telemetry to monitor false positives before deploying this change.

⚠️ Missing Test Coverage

The PR doesn't include any tests for the new timeout behavior. Consider adding:

Tests to verify the 15-second threshold is correctly enforced
Tests to ensure runners aren't prematurely evicted
Integration tests for the ping mechanism with the new timing

According to CLAUDE.md's testing guidelines, you should pipe test output to /tmp/ and grep for verification.

📝 Documentation Considerations

Constant Synchronization Risk: The ping interval (3s) is defined in TypeScript, but the timeout (15s) is in Rust. If someone changes the TypeScript ping interval without updating the Rust constant, the system could break. Consider:
- Adding a comment in both files cross-referencing each other
- Better yet, having the server communicate the timeout to the client during initialization (I see this is already done via ProtocolMetadata at line 119, which is good!)
Missing Rationale: The PR body is empty. It would be helpful to understand:
- Why was this change needed?
- What problem does it solve?
- Were there production issues with the 2-minute timeout?

Performance Considerations

✅ Positive Impact:

Faster detection of lost runners means actors can be rescheduled more quickly, improving system resilience.
Less resource waste from zombie runners.

⚠️ Potential Negative Impact:

More aggressive timeouts could increase churn and cause unnecessary actor migrations if the threshold is too tight.

Security Concerns

✅ No security issues identified. The change is purely operational timing.

Architecture Alignment

Looking at the code at runner.rs:119, I can see that RUNNER_LOST_THRESHOLD_MS is communicated to the client via ProtocolMetadata. This is good architecture - the client can adapt to server-side timeout changes. However, the TypeScript client should ideally use this value to adjust its ping interval dynamically rather than having it hardcoded.

Recommendations

Consider a more conservative timeout (30-45s) or add metrics/monitoring before rolling out to production
Add test coverage for the new timeout behavior
Add telemetry to track how often the timeout is triggered to validate the new threshold
Document the rationale in the PR description
Consider making the ping interval and timeout configurable rather than hardcoded constants

Summary

The code changes are clean and well-structured, but the 15-second timeout appears aggressive for production use. The 8x reduction in tolerance (from 120s to 15s) significantly increases the risk of false positives. I recommend gathering data on actual ping latencies in production before merging, or starting with a more conservative value like 30-45 seconds.

Verdict: ⚠️ Request Changes - Consider a less aggressive timeout or add supporting data/metrics to validate this threshold.

Review generated with context from CLAUDE.md and codebase conventions.

graphite-app · 2025-11-04T20:09:58Z

Merge activity

Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
Nov 4, 8:52 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 4, 8:53 PM UTC: CI is running for this pull request on a draft pull request (#3353) due to your merge queue CI optimization settings.
Nov 4, 8:56 PM UTC: Merged by the Graphite merge queue via draft PR: #3353.

claude · 2025-11-04T20:26:34Z

PR Review: chore(runner): decrease timeout to 15s

Summary

This PR reduces the runner lost threshold from 2 minutes to 15 seconds and consolidates the ping interval constant to a single named constant (RUNNER_PING_INTERVAL). The change makes the system more responsive to runner failures.

Code Quality & Best Practices

✅ Good:

Added helpful documentation comment explaining the relationship between ping interval (3s) and timeout (15s)
Consolidated the ping interval into a named constant (RUNNER_PING_INTERVAL = 3_000), improving maintainability
Follows existing patterns in the codebase for timeout constants
Removed unnecessary blank lines (formatting cleanup)

⚠️ Considerations:

Timeout/Ping Ratio: The new 15s timeout with a 3s ping interval gives you a 5x multiplier (15s / 3s = 5 pings). This is quite aggressive compared to the previous 40x multiplier (120s / 3s = 40 pings).
- Risk: If there's any network hiccup causing 5 consecutive ping failures, runners will be marked as lost
- Recommendation: Consider if 5x is sufficient buffer for network reliability in production. Common practice is 3-5x for fast fail-over or 10-15x for stability. You might want to track metrics on false positives after deployment.
TypeScript Formatting Change: The PR includes a formatting change to getActorHibernationConfig parameters that appears unrelated to the timeout change. While this is fine, it's worth noting this was likely from an auto-formatter run.

Performance Considerations

✅ Positive Impact:

Faster detection of lost runners (15s vs 2m) means:
- Actors can be rescheduled more quickly
- Reduced resource waste on "zombie" runners
- Better user experience with faster recovery

⚠️ Potential Impact:

More frequent runner churn if network is unstable
Increased database operations from more frequent runner state transitions
More actor rescheduling operations if false positives occur

Recommendation: Monitor these metrics post-deployment:

Runner expiration rate
Actor rescheduling frequency
Average runner lifetime
False positive runner expirations (runners that reconnect after being marked lost)

Security Concerns

✅ No security concerns identified. The change is purely operational/timing related.

Test Coverage

⚠️ Missing:

No tests included in this PR
Consider adding a test case that verifies timeout behavior:
- Runner marked as lost after RUNNER_LOST_THRESHOLD_MS without ping
- Runner NOT marked as lost if pings arrive within threshold
- Integration test for the 15s timeout in the workflow loop (line 91: listen_with_timeout::<Main>(RUNNER_LOST_THRESHOLD_MS))

Additional Notes

Constant Naming Consistency: The Rust constant is RUNNER_LOST_THRESHOLD_MS while the TypeScript constant is RUNNER_PING_INTERVAL. Consider if RUNNER_PING_INTERVAL_MS would be more consistent (though the _MS suffix is less necessary for the interval constant).
Protocol Metadata: The timeout value is correctly communicated to clients via protocol::ProtocolMetadata (line 119), so clients will be aware of the new threshold.
Related Constants: The RUNNER_ELIGIBLE_THRESHOLD_MS = 10s (line 15) remains unchanged. This means:
- After 10s without ping: ineligible for new allocations
- After 15s without ping: marked as lost and removed
- This 5s gap seems reasonable for graceful degradation.

Overall Assessment

Verdict: ✅ Approve with monitoring recommendation

The code changes are clean and well-implemented. The aggressive timeout reduction is a significant operational change that should be monitored closely in production. The 5x safety margin is on the lower end but may be appropriate for your use case if low-latency recovery is prioritized over stability.

Action Items:

✅ Code looks good - approve
⚠️ Add monitoring/alerts for runner expiration rates post-deployment
💡 Consider adding integration tests for timeout behavior
💡 Track false positive metrics to validate 15s is the right threshold

🤖 Generated with Claude Code

This was referenced Nov 3, 2025

fix(rivetkit): ack ws messages #3329

Closed

chore(rivetkit): correctly stringify engine runner request ids for logs #3330

Closed

chore(rivetkit): fmt #3331

Closed

vercel bot deployed to Preview – rivetkit-serverless November 3, 2025 22:22 View deployment

NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from a0d58a1 to 9fbbebe Compare November 4, 2025 00:36

NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 7895f28 to 546e885 Compare November 4, 2025 00:36

NathanFlurry requested a review from MasterPtato November 4, 2025 00:37

This was referenced Nov 4, 2025

fix(rivetkit): fix not pino bindings #3334

Closed

chore(rivetkit): add debug logging around sleep condition #3336

Closed

NathanFlurry mentioned this pull request Nov 4, 2025

chore(engine): move smoke test to engine/tests/ #3337

Closed

vercel bot deployed to Preview – rivetkit-serverless November 4, 2025 00:38 View deployment

MasterPtato approved these changes Nov 4, 2025

View reviewed changes

This was referenced Nov 4, 2025

fix(guard): include method in cache key #3338

Closed

chore(guard): simplify gateway routing #3339

Closed

chore(guard): add support for routing runner ws to /runners/connect #3340

Closed

chore(guard): handle cors on the gateway #3341

Closed

NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from 9fbbebe to d9e0eb8 Compare November 4, 2025 18:59

vercel bot deployed to Preview – rivetkit-serverless November 4, 2025 19:00 View deployment

NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 0423a08 to dcca850 Compare November 4, 2025 19:24

NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from d9e0eb8 to d2735a8 Compare November 4, 2025 19:24

vercel bot deployed to Preview – rivetkit-serverless November 4, 2025 19:25 View deployment

This was referenced Nov 4, 2025

chore(rivetkit): debounce ws message ack #3345

Closed

chore(rivetkit): auto-update auto-start engine version #3344

Closed

ci: trigger rust ci for generated bare artifacts #3346

Closed

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025

chore(runner): decrease timeout to 15s (#3328)

4a5d468

chore(runner): decrease timeout to 15s

a114914

NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from dcca850 to e4d3b77 Compare November 4, 2025 20:25

NathanFlurry force-pushed the 11-03-chore_runner_decrease_timeout_to_15s branch from d2735a8 to a114914 Compare November 4, 2025 20:25

vercel bot deployed to Preview – rivetkit-serverless November 4, 2025 20:26 View deployment

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025

chore(runner): decrease timeout to 15s (#3328)

60fc116

graphite-app bot closed this Nov 4, 2025

graphite-app bot deleted the 11-03-chore_runner_decrease_timeout_to_15s branch November 4, 2025 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(runner): decrease timeout to 15s #3328

chore(runner): decrease timeout to 15s #3328

Uh oh!

NathanFlurry commented Nov 3, 2025

Uh oh!

vercel bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

NathanFlurry commented Nov 3, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 3, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

graphite-app bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore(runner): decrease timeout to 15s #3328

chore(runner): decrease timeout to 15s #3328

Uh oh!

Conversation

NathanFlurry commented Nov 3, 2025

Uh oh!

vercel bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 3, 2025

Code Review: Decrease Runner Lost Timeout to 15s

Summary

Positive Observations

Critical Concerns

1. Aggressive Timeout Ratio (HIGH PRIORITY)

2. Impact Analysis Missing

3. Relationship with RUNNER_ELIGIBLE_THRESHOLD

4. Potential Race Conditions

5. Actor Eviction Impact

6. Missing Test Coverage

Code Quality

Security Considerations

Performance Considerations

Recommendations

Questions for the Author

Verdict

Uh oh!

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

Code Quality ✅

Potential Issues ⚠️

Performance Considerations 🔄

Security Concerns ✅

Test Coverage ❓

Additional Observations

Recommendations

Verdict

Uh oh!

claude bot commented Nov 4, 2025

PR Review: Decrease Runner Timeout to 15s

Summary

Code Quality & Best Practices

1. Magic number ratio

2. Cross-language coordination

Potential Bugs & Issues

1. Network instability tolerance ⚠️

2. Race conditions during network recovery

Performance Considerations

Security Concerns

Test Coverage

Additional Observations

1. Related code in mod.ts:255

2. Consider making timeout configurable

Recommendations Summary

Conclusion

Uh oh!

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

Code Quality & Best Practices

Potential Issues

1. Timeout Safety Margin (Critical)

2. Race Condition Risk

3. Impact on Actor Rescheduling

Performance Considerations

Security Concerns

Test Coverage

Suggested Improvements

Questions for the Author

Verdict

Uh oh!

claude bot commented Nov 4, 2025

PR Review: Decrease runner timeout to 15s

Summary

Code Quality & Best Practices

vercel bot commented Nov 3, 2025 •

edited

Loading

NathanFlurry commented Nov 3, 2025 •

edited

Loading

1. Related code in `mod.ts:255`

graphite-app bot commented Nov 4, 2025 •

edited

Loading