-
Notifications
You must be signed in to change notification settings - Fork 130
chore(runner): decrease timeout to 15s #3328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(runner): decrease timeout to 15s #3328
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
Code Review: Decrease Runner Lost Timeout to 15sSummaryThis PR reduces the Positive Observations✅ Documentation Added: Good addition of the comment noting the ping interval is 3s Critical Concerns1. Aggressive Timeout Ratio (HIGH PRIORITY)The new timeout is only 5x the ping interval (15s timeout / 3s ping). This is quite aggressive and could lead to false positives in production:
Recommendation: Consider a more conservative ratio. Industry best practices typically use 3-5 missed heartbeats, but with more buffer. A timeout of 30-45 seconds (10-15x ping interval) would be more robust. 2. Impact Analysis MissingThe PR doesn't explain:
Question: What motivated this change? Is there a specific issue with runners staying in the system too long? 3. Relationship with RUNNER_ELIGIBLE_THRESHOLDThe code has two thresholds:
The window between "ineligible" and "lost" is now only 5 seconds. This seems tight. Question: Is 5 seconds enough time for the system to react to an ineligible runner before forcibly evicting it? 4. Potential Race ConditionsLooking at
With a tight 15s timeout, these factors become more significant. 5. Actor Eviction ImpactPer the comment at
This is a destructive operation. If actors are evicted due to a false positive, it could:
Recommendation: Consider adding metrics to track how often runners are marked as lost, and monitor after deploying this change. 6. Missing Test CoverageI don't see any test changes validating:
Recommendation: Add integration tests for the timeout behavior. Code QualityTypeScript Changes (
Rust Changes (
Security ConsiderationsNo direct security issues, but the aggressive timeout could be exploited:
Performance Considerations
Recommendations
Questions for the Author
VerdictReview generated following CLAUDE.md conventions |
a0d58a1 to
9fbbebe
Compare
7895f28 to
546e885
Compare
PR Review: Decrease runner timeout to 15sSummaryThis PR reduces the runner lost threshold from 2 minutes to 15 seconds and introduces a constant for the ping interval (3s) in the TypeScript SDK. The change aims to detect lost runners more quickly. Code Quality ✅Good:
Minor Issues:
Potential Issues
|
PR Review: Decrease Runner Timeout to 15sI've reviewed this PR and have several observations and recommendations: SummaryThis PR reduces the runner lost threshold from 2 minutes to 15 seconds, while the ping interval remains at 3 seconds. This change makes the system more responsive to lost runners but introduces some risks. Code Quality & Best Practices✅ Positive aspects:
1. Magic number ratioThe ratio between
This would make the relationship clearer and easier to adjust. 2. Cross-language coordinationThe TypeScript SDK (
Potential Bugs & Issues1. Network instability tolerance
|
9fbbebe to
d9e0eb8
Compare
PR Review: Decrease runner timeout to 15sSummaryThis PR reduces the Code Quality & Best PracticesPositive:
Concerns:
Potential Issues1. Timeout Safety Margin (Critical)The current configuration has a 5x safety margin (15s timeout / 3s ping = 5 pings). However:
2. Race Condition RiskReducing timeout from 2 minutes to 15 seconds significantly increases the risk of race conditions between:
The code at 3. Impact on Actor ReschedulingWhen a runner times out, all actors are marked as lost (
Performance Considerations
Security ConcernsNo security issues identified. Test CoverageMissing:
Recommendations:
Suggested Improvements// In runner.rs, consider making these configurable or better documented:
/// Runner ping interval in milliseconds. This is set in the TypeScript SDK.
const RUNNER_PING_INTERVAL_MS: i64 = util::duration::seconds(3);
/// How long to wait after last ping before forcibly removing a runner.
/// Set to 5x the ping interval to allow for some missed pings due to transient issues.
const RUNNER_LOST_THRESHOLD_MS: i64 = RUNNER_PING_INTERVAL_MS * 5;Questions for the Author
Verdict
The current 5x safety margin is at the lower bound of what's typically recommended for distributed systems with network communication. Review generated with assistance from Claude Code |
0423a08 to
dcca850
Compare
d9e0eb8 to
d2735a8
Compare
PR Review: Decrease runner timeout to 15sSummaryThis PR reduces the Code Quality & Best PracticesPositive:
Minor formatting improvements:
Potential Issues & Concerns🔴 Critical: Aggressive timeout may cause false positives The 15-second timeout is very aggressive and could lead to false positive runner evictions in the following scenarios:
Risk analysis:
Recommendation: Consider a middle ground like 30-45 seconds (10-15 missed pings) to balance responsiveness with reliability. The PR description is empty, making it difficult to understand:
With such an aggressive timeout, you should consider:
From the code at Performance ConsiderationsPing frequency impact:
Database queries on timeout:
Security ConsiderationsNo significant security concerns. The change makes the system more aggressive in evicting potentially-dead runners, which could be viewed as improving resource management but also as increasing the attack surface for DoS via connection disruption. Test CoverageMissing:
Existing coverage: Additional ObservationsRust side (runner.rs:20): /// Runner ping interval is currently set to 3s.
const RUNNER_LOST_THRESHOLD_MS: i64 = util::duration::seconds(15);TypeScript side (mod.ts:532): }, RUNNER_PING_INTERVAL);✅ The constant is properly synchronized between both sides. Consider:
RecommendationsBefore merging:
Future improvements:
Verdict |
PR Review: Decrease Runner Timeout to 15sOverviewThis PR reduces the runner lost threshold from 2 minutes to 15 seconds, aligning it more closely with the 3-second ping interval. The changes span both the Rust backend (Pegboard) and TypeScript SDK. Code Quality & Best Practices✅ Positive Observations
Potential Issues & Concerns
|
Merge activity
|
dcca850 to
e4d3b77
Compare
d2735a8 to
a114914
Compare
PR Review: chore(runner): decrease timeout to 15sSummaryThis PR reduces the runner lost threshold from 2 minutes to 15 seconds and consolidates the ping interval constant to a single named constant ( Code Quality & Best Practices✅ Good:
Performance Considerations✅ Positive Impact:
Recommendation: Monitor these metrics post-deployment:
Security Concerns✅ No security concerns identified. The change is purely operational/timing related. Test Coverage
Additional Notes
Overall AssessmentVerdict: ✅ Approve with monitoring recommendation The code changes are clean and well-implemented. The aggressive timeout reduction is a significant operational change that should be monitored closely in production. The 5x safety margin is on the lower end but may be appropriate for your use case if low-latency recovery is prioritized over stability. Action Items:
🤖 Generated with Claude Code |

No description provided.