Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 8:26pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 8:26pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 8:26pm
rivet-site Ignored Ignored Preview Nov 4, 2025 8:26pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

Code Review

Summary

This PR correctly fixes HTTP requests not returning 503 when an actor is removed by changing the status code from 404 to 503 and adding the required x-rivet-error header. This ensures Guard's retry logic properly handles these cases.

Positive Feedback

Correct Implementation: The change properly implements the Guard retry protocol:

  • Returns 503 status code (SERVICE_UNAVAILABLE)
  • Includes x-rivet-error header as required by should_retry_request_inner in guard-core/src/proxy_service.rs:2458-2459
  • Error code runner.actor_not_found follows the group.code naming convention

Good Documentation: The inline comment with GitHub link provides excellent context for future maintainers

Consistent Pattern: Matches the retry documentation in engine/contrib-docs/GUARD.md

Issues & Recommendations

1. Inconsistency with WebSocket Handling (Minor)

The WebSocket path handles the same scenario differently at tunnel.ts:466-481:

  • WebSocket: sends close with retry: false (line 478)
  • HTTP: returns 503 to trigger retry

Question: Should WebSockets also retry when actors are not found? If yes, consider setting retry: true for consistency. If no, please document why HTTP and WebSocket diverge in their retry behavior.

2. Missing Test Coverage (Moderate)

This is a critical retry path but appears to lack test coverage:

  • No tests verify the 503 + x-rivet-error header combination
  • No tests verify Guard actually retries on this response

Recommendation: Add a test case to verify:

// Pseudo-code
test('returns 503 with retry header when actor not found', async () => {
  const response = await tunnel.fetch(unknownActorId, request);
  expect(response.status).toBe(503);
  expect(response.headers.get('x-rivet-error')).toBe('runner.actor_not_found');
});

3. Error Code Validation (Minor)

The error code runner.actor_not_found is not validated against Guard's expected error codes. While the header presence triggers retry, consider documenting the expected error code format or validating it matches Guard's expectations.

4. Response Body Clarity (Minor)

The response body "Actor not found" might be confusing since this is a retryable 503, not a final "not found" error. Consider a body like "Actor temporarily unavailable" to better reflect the transient nature.

Security Considerations

✅ No security concerns identified. The change appropriately handles missing actors without exposing internal state.

Performance Considerations

✅ No performance concerns. The change is a simple status code + header modification with negligible overhead.

Code Quality

✅ Follows TypeScript best practices
✅ Follows project conventions from CLAUDE.md
✅ Clear and informative comment

Overall Assessment

APPROVED with minor recommendations. The core fix is correct and well-implemented. The main suggestion is to add test coverage for this critical retry path to prevent future regressions.

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from 65f206c to b372bb1 Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_dupe_onactorstop_call branch from b62cf5d to 4fb3477 Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR correctly fixes HTTP requests not returning 503 when an actor is removed by changing the status code from 404 to 503 and adding the required x-rivet-error header. This ensures Guard's retry logic properly handles these cases.

Positive Feedback

Correct Implementation: The change properly implements the Guard retry protocol:

  • Returns 503 status code (SERVICE_UNAVAILABLE)
  • Includes x-rivet-error header as required by should_retry_request_inner in engine/packages/guard-core/src/proxy_service.rs:2458-2459
  • Error code runner.actor_not_found follows the group.code naming convention

Good Documentation: The inline comment with GitHub link provides excellent context for future maintainers

Consistent Pattern: Matches the retry mechanism used throughout the Guard proxy service

Issues & Recommendations

1. Inconsistency with WebSocket Handling (Minor)

The WebSocket path handles the same scenario differently - it sends a close with retry: false.

Question: Should WebSockets also retry when actors are not found? If yes, consider setting retry: true for consistency. If no, please document why HTTP and WebSocket diverge in their retry behavior.

2. Missing Test Coverage (Moderate Priority)

This is a critical retry path but appears to lack test coverage:

  • No tests verify the 503 + x-rivet-error header combination
  • No tests verify Guard actually retries on this response
  • The existing test file (engine/sdks/typescript/runner/tests/lifecycle.test.ts) is entirely commented out

Recommendation: Add a test case to verify the retry behavior works end-to-end, especially since this is part of a critical error handling path.

3. Error Code Validation (Minor)

The error code runner.actor_not_found is not validated against Guard's expected error codes. While the header presence triggers retry, consider documenting the expected error code format or ensuring it aligns with Guard's error handling patterns.

4. Response Body Clarity (Minor)

The response body "Actor not found" might be confusing since this is a retryable 503, not a final "not found" error. Consider a body like "Actor temporarily unavailable" to better reflect the transient nature of the error.

Security & Performance

✅ No security concerns identified
✅ No performance concerns - simple status code + header modification

Code Quality

✅ Follows TypeScript best practices
✅ Clear and informative comment with reference link
✅ Minimal, focused change

Overall Assessment

APPROVED with recommendations. The core fix is correct and well-implemented. The main suggestion is to add test coverage for this critical retry path to prevent future regressions.

@vercel
Copy link

vercel bot commented Nov 4, 2025

Deployment failed with the following error:

Too many requests - try again in 5 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR fixes HTTP requests to return 503 with retry logic when an actor is not found, instead of returning a 404. The change aligns with Guard's retry mechanism that checks for 503 + x-rivet-error header.

Positive Observations

Correct Fix: The change properly implements the retry protocol expected by Guard. The should_retry_request_inner function in guard-core/src/proxy_service.rs:2458-2460 specifically checks for:

  • Status code 503 (SERVICE_UNAVAILABLE)
  • Presence of x-rivet-error header

Good Documentation: The inline comment with the GitHub link to the Guard retry logic is excellent for maintainability.

Consistent Error Handling: Using runner.actor_not_found as the error code follows a reasonable naming convention (group.code pattern).

Issues & Concerns

1. Inconsistent WebSocket Behavior (Medium Priority)

The WebSocket handler has a similar "actor not found" case at tunnel.ts:467-481 that closes the connection with retry: false. Should this also be updated to support retries for consistency?

// Line 467-481: WebSocket handling
if (!actor) {
    logger()?.warn({
        msg: "ignoring websocket for unknown actor",
        actorId: open.actorId,
    });
    // Send close immediately
    this.#sendMessage(requestId, {
        tag: "ToServerWebSocketClose",
        val: {
            code: 1011,
            reason: "Actor not found",
            retry: false,  // Should this be true?
        },
    });
    return;
}

2. Error Code Registration (Low Priority)

The error code runner.actor_not_found doesn't appear to be formally registered in the error artifacts system. Other errors have JSON files in engine/artifacts/errors/ (e.g., guard.actor_not_found.json, actor.not_found.json). Consider whether this should be registered for consistency, though it may not be strictly necessary since it's only used as a header value.

3. Missing Test Coverage (Medium Priority)

No test appears to cover this specific scenario. Consider adding a test case to verify:

  • 503 status code is returned when actor is not found
  • x-rivet-error header is present with correct value
  • Guard properly retries the request

Minor Observations

  • The error message "Actor not found" in the response body is appropriate
  • The structured logging approach is consistent with the codebase patterns
  • No performance concerns with this change

Recommendation

Approve with suggested follow-ups:

  1. Consider updating WebSocket handling for consistency (can be done in a follow-up PR if needed)
  2. Add test coverage for this scenario
  3. Verify error code registration requirements

The core fix is correct and addresses the issue properly. The suggested improvements are for consistency and robustness but don't block this PR.

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_dupe_onactorstop_call branch from f89391d to cadd5cb Compare November 4, 2025 19:24
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from 102e612 to bd0a1b7 Compare November 4, 2025 19:24
@claude
Copy link

claude bot commented Nov 4, 2025

Code Review - PR #3322

Summary

This PR fixes a critical bug where HTTP requests were returning a 404 status code when an actor was not found, instead of a 503 that would trigger Guard's retry logic. The change correctly implements the retry protocol by returning 503 with the x-rivet-error header.

Analysis

✅ Strengths

  1. Correct Implementation: The fix properly aligns with Guard's retry mechanism:

    • Returns 503 SERVICE_UNAVAILABLE status code
    • Includes x-rivet-error: runner.actor_not_found header
    • Matches the requirements in guard-core/src/proxy_service.rs:2458-2460
  2. Excellent Documentation: The inline comment with the GitHub link to the Guard retry logic (should_retry_request_inner) provides valuable context for future maintainers.

  3. Appropriate Error Handling: The runner.actor_not_found error code follows a consistent naming pattern.

  4. Minimal, Focused Change: The PR addresses exactly one issue without introducing scope creep.

🔍 Observations & Recommendations

1. WebSocket Inconsistency (Minor)

The WebSocket handler at tunnel.ts:467-481 handles the same "actor not found" scenario but sets retry: false. Consider whether WebSockets should also retry when actors are unavailable:

// Current WebSocket behavior
this.#sendMessage(requestId, {
    tag: "ToServerWebSocketClose",
    val: {
        code: 1011,
        reason: "Actor not found",
        retry: false,  // ← Should this match HTTP's retry behavior?
    },
});

If the different behavior is intentional, consider adding a comment explaining why HTTP retries but WebSockets don't.

2. Test Coverage Gap (Moderate Priority)

This is a critical error path but appears to lack test coverage. Consider adding tests to verify:

  • Response returns 503 status code
  • x-rivet-error header is present with correct value
  • Guard's retry logic is triggered by this response
  • The retry eventually succeeds when the actor becomes available

3. Response Body Message (Minor)

The response body "Actor not found" might be misleading since 503 indicates a temporary condition. Consider using "Actor temporarily unavailable" to better communicate that this is a retryable error rather than a permanent "not found" condition.

Security & Performance

No security concerns: The change appropriately handles missing actors without exposing internal state
No performance impact: Simple status code and header modification with negligible overhead

Code Quality

✅ Follows TypeScript conventions
✅ Consistent with project patterns from CLAUDE.md
✅ Clear, maintainable code with good documentation

Verdict

APPROVED

The core fix is correct and properly addresses the issue. The recommendations above are for future improvements and don't block this PR. This is a solid bug fix that will improve the reliability of actor request handling.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix HTTP requests returning 503 when actor removed

Summary

This PR changes the response code from 404 Not Found to 503 Service Unavailable (with the x-rivet-error header) when an HTTP request is made to an actor that doesn't exist on the runner. This enables Guard's retry mechanism to attempt to find the actor on a different runner.

Code Quality: ✅ Excellent

Strengths:

  • Clear, detailed comment explaining the behavior and linking to the relevant Guard code
  • Follows the documented retry protocol pattern (see engine/contrib-docs/GUARD.md)
  • Minimal, focused change that addresses a specific issue
  • Properly uses the x-rivet-error header pattern that Guard expects

Best Practices: ✅ Good

Observations:

  • The error code runner.actor_not_found follows a sensible naming convention (group.code format)
  • The response message "Actor not found" is clear and descriptive
  • The change is consistent with Guard's retry mechanism expectations (should_retry_request_inner in proxy_service.rs:2458)

Potential Issues: ⚠️ Minor Concerns

1. Error Code Consistency
The error code runner.actor_not_found doesn't appear to have a corresponding error definition in the Rust error system. Guard has a guard.actor_not_found error (engine/packages/guard/src/errors.rs:44-50), but there's no runner.actor_not_found error defined. Consider:

  • Is this intentional? (The header value may not need a formal error type)
  • Should this align with existing error codes, or is the ad-hoc header value acceptable for this use case?

2. Semantic Correctness
When the runner doesn't have the actor, returning 503 Service Unavailable is technically correct for retry purposes, but the semantic meaning could be debated:

  • 503 typically means "service temporarily unavailable"
  • In this case, the actor is unavailable on this runner, but may be available on another
  • The retry mechanism will re-resolve the route, which should find the correct runner
  • Verdict: This is the correct approach given Guard's retry architecture ✅

Performance: ✅ No Concerns

  • No performance impact
  • The change enables proper retry behavior which should improve overall system resilience

Security: ✅ No Concerns

  • No security implications identified
  • The error header doesn't leak sensitive information

Test Coverage: ⚠️ Needs Attention

Missing Test Coverage:
The test file engine/sdks/typescript/runner/tests/lifecycle.test.ts is entirely commented out. While this specific change is straightforward, the broader test suite should be re-enabled. Consider:

  1. Integration test scenario - Test that Guard properly retries when receiving this response:

    • Actor exists on runner B but not runner A
    • Request initially routed to runner A
    • Runner A returns 503 with x-rivet-error: runner.actor_not_found
    • Guard retries and successfully routes to runner B
  2. Unit test - Verify the response format when hasActor() returns false

Recommendations

High Priority:

  1. No blocker issues - This change is ready to merge from a functionality standpoint

Medium Priority:

  1. Document whether runner.actor_not_found should be formalized as an error type, or if ad-hoc header values are acceptable for retry signaling
  2. Consider adding a comment explaining why 503 (not 404) is semantically correct for this scenario (hint: it triggers Guard's route re-resolution)

Low Priority:

  1. Re-enable the test suite in lifecycle.test.ts (separate PR recommended)
  2. Add integration test coverage for the Guard retry flow with this scenario

Verdict: ✅ Approve

This is a well-implemented fix that correctly leverages Guard's retry mechanism. The change is minimal, well-documented, and follows established patterns. The semantic choice of 503 over 404 is correct for this distributed system architecture where an actor may exist on a different runner.

No blocking issues identified.


Generated by Claude Code 🤖

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:49 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:50 PM UTC: CI is running for this pull request on a draft pull request (#3352) due to your merge queue CI optimization settings.
  • Nov 4, 8:52 PM UTC: Merged by the Graphite merge queue via draft PR: #3352.

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from bd0a1b7 to 6f5d613 Compare November 4, 2025 20:25
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_dupe_onactorstop_call branch from cadd5cb to 0f73c04 Compare November 4, 2025 20:25
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix HTTP requests not returning 503 when actor removed

Summary

This PR changes the response status code from 404 Not Found to 503 Service Unavailable when a request is made to an actor that doesn't exist on the runner, enabling Guard's retry mechanism to kick in and discover the actor's new location.


Code Quality & Best Practices ✅

Strengths:

  • Excellent documentation with inline comments explaining the retry mechanism
  • GitHub link to the relevant retry logic (should_retry_request_inner)
  • Follows the established pattern documented in GUARD.md for retry-triggering responses
  • Clear intent: changing from non-retryable (404) to retryable (503 + header) response

Observations:

  • The error code uses "runner.actor_not_found" rather than the existing "guard.actor_not_found" error defined in packages/guard/src/errors.rs:44. This appears intentional to distinguish between:

    • guard.actor_not_found - Guard can't find the actor in its routing logic
    • runner.actor_not_found - Runner received a request for an actor it doesn't have

    This distinction makes sense for debugging, but should be documented.


Correctness & Logic ✅

The fix correctly addresses the race condition where:

  1. Guard routes a request to a runner based on cached routing info
  2. Actor gets removed/moved between cache and request
  3. Runner receives request for non-existent actor
  4. Previously returned 404 (non-retryable) → User saw error
  5. Now returns 503 (retryable) → Guard bypasses cache and finds new location

The implementation matches the retry contract in guard-core/src/proxy_service.rs:2458-2459:

fn should_retry_request_inner(status: StatusCode, headers: &hyper::HeaderMap) -> bool {
    status == StatusCode::SERVICE_UNAVAILABLE && headers.contains_key(X_RIVET_ERROR)
}

Performance Considerations ✅

Positive impact:

  • Enables optimistic routing with graceful failure recovery
  • Avoids expensive proactive cache invalidation
  • Retry backoff (100ms, 200ms, 400ms) prevents thundering herd
  • Max 3 attempts prevents infinite retry loops

No concerns - this is exactly the behavior Guard's retry mechanism was designed for.


Security Concerns ✅

No security issues identified:

  • Doesn't expose sensitive information (actor ID already in request headers)
  • Retry mechanism has bounded attempts (max 3)
  • Error response is appropriate for the condition

Test Coverage ⚠️

Missing test coverage:

  • No test found covering this specific error path in runner/tests/lifecycle.test.ts or elsewhere
  • Consider adding a test case that:
    1. Makes a request to a non-existent actor
    2. Verifies the response is 503 with x-rivet-error header
    3. Confirms the header value is "runner.actor_not_found"

Recommendation: Add integration test to prevent regression, especially since this fixes a user-facing bug.


Suggestions for Improvement

  1. Document the error code: Add runner.actor_not_found to either:

    • A new error definition file for runner errors
    • Documentation explaining runner-specific error codes
  2. Add test coverage: See test coverage section above

  3. Consider WebSocket handling: Verify similar logic exists for WebSocket connections when actors are removed (appears handled separately in #closeActorWebSockets)


Verdict: ✅ APPROVE with minor suggestions

This is a well-reasoned fix that correctly implements Guard's retry mechanism. The code change is minimal, well-documented, and addresses a real race condition. The only improvement would be adding test coverage to prevent regression.

Impact: This fix significantly improves user experience by making actor routing resilient to cache staleness and actor lifecycle changes.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix HTTP requests not returning 503 when actor removed

Summary

This PR changes the response when an HTTP request arrives for an actor that no longer exists on the runner, from a 404 Not Found to a 503 Service Unavailable with the x-rivet-error: runner.actor_not_found header. This enables Guard's retry mechanism to automatically attempt the request against a fresh routing lookup.


Positive Aspects

  1. Correct use of retry pattern: The change properly implements Guard's retry protocol (503 + x-rivet-error header) as documented in engine/contrib-docs/GUARD.md.

  2. Good documentation: The inline comment with a GitHub link to the relevant retry logic in proxy_service.rs:2458 is excellent for maintainability.

  3. Solves a real problem: When an actor is removed/destroyed, cached routes in Guard become stale. The 404 response prevented retries, meaning requests would fail until the cache naturally expired. This change enables the optimistic routing + cache invalidation pattern.

  4. Semantic correctness: 503 is more semantically accurate than 404 for this scenario - the actor resource may still exist (perhaps on a different runner), but this specific runner can't service the request.


Concerns & Recommendations

1. Missing Test Coverage

There are no tests covering this scenario. I'd recommend adding:

Integration test in engine/packages/guard-core/tests/proxy.rs:

  • Test that verifies when a runner returns 503 with x-rivet-error, Guard retries with cache invalidation
  • Similar to test_retry_functionality() but specifically testing the 503 + header pattern

Runner SDK test (if applicable):

  • Test in the TypeScript runner to verify the correct response format

2. Error Code Consistency

The error code runner.actor_not_found follows a different naming convention than Guard's errors:

  • Guard uses: guard.actor_not_found (defined in engine/packages/guard/src/errors.rs:44)
  • This PR uses: runner.actor_not_found

Question: Is this intentional to distinguish between "Guard can't find actor" vs "Runner doesn't have actor"? If so, this is good. Otherwise, consider consistency.

3. Retry Implications

Current behavior:

  • Initial request → cached route to Runner A
  • Runner A returns 503 (actor removed)
  • Guard retries with fresh lookup → finds actor on Runner B
  • Request succeeds

Edge case consideration:
What happens if the actor is truly destroyed (not just moved)? The retry will:

  1. Do fresh lookup (finds no actor anywhere)
  2. Eventually return 502 after max retries

This seems acceptable, but worth documenting the expected behavior when an actor is permanently destroyed vs temporarily unavailable.

4. WebSocket Handling

I notice WebSocket connections have their own close mechanism (tunnel.ts:203):
ws.__closeWithRetry(1000, "Actor stopped");

Is there a similar scenario for WebSockets that should also return 503? Or does the WebSocket protocol handle this differently? (Not blocking, just curious about consistency)


Code Quality Assessment

Style: Follows TypeScript conventions
Comments: Excellent inline documentation
Error handling: Appropriate response format
Performance: No performance concerns
Security: No security issues identified


Recommendation

Approve with suggestions. The core change is correct and valuable. Primary recommendation is to add test coverage for this retry scenario to prevent regressions.


Additional Questions for Author

  1. Should we add metrics/logging to track how often this retry path is taken? It could provide visibility into actor routing cache hit rates.
  2. Are there other similar scenarios in the runner that should also return 503 for retry-ability?

Generated with Claude Code

@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch November 4, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants