Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 6, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 6, 2025 5:37am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 6, 2025 5:37am
rivet-inspector Ignored Ignored Preview Nov 6, 2025 5:37am
rivet-site Ignored Ignored Preview Nov 6, 2025 5:37am

Copy link
Member Author

NathanFlurry commented Nov 6, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 6, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3381

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3381

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3381

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3381

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3381

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3381

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3381

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3381

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3381

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3381

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3381

commit: d93280c

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review

Summary

This PR fixes WebSocket close code 1006 errors that occur when clients close WebSocket connections. The fix adds proper error handling for close frame send/flush operations and ensures a proper close frame is always sent, even when no explicit close frame is provided.


Positive Aspects

1. Proper Error Handling

  • The addition of error handling for send() and flush() operations is the right approach. These operations can legitimately fail when the client has already initiated closure, and treating this as non-fatal is correct.

2. Good Logging

  • Extensive debug logging at each step helps diagnose WebSocket closure issues. The structured logging follows the project conventions (?err syntax).
  • Log messages are lowercase as per CLAUDE.md conventions.

3. Graceful Degradation

  • The code continues execution even if send/flush fails, which is appropriate since the connection is closing anyway.

Issues and Concerns

1. Overly Verbose Logging (Minor)

The current implementation has 6 debug log statements in a 30-line block. Consider consolidating some of these logs. The success cases may not need separate logs.

2. Breaking Change in to_hyper_close() (Medium Severity)

Location: proxy_service.rs:2567-2573

The function signature suggests frame: Option where None means no close frame. However, the new implementation always returns a close frame with CloseCode::Normal and reason "ws.closed".

Concerns:

  • This changes the semantics of passing None - previously it meant "send no close frame", now it means "send a default close frame"
  • The hardcoded reason "ws.closed" is not very descriptive
  • This may affect other call sites of to_hyper_close()

Recommendation:

  • Check all call sites to ensure this behavior change is intentional
  • Consider a more descriptive close reason or make it configurable
  • Document why this change was necessary for fixing the 1006 error

3. Unrelated Script Changes (Low Severity)

Location: scripts/run/engine-rocksdb.sh:10-14

The addition of Pegboard timeout configurations appears unrelated to the WebSocket close fix. These extremely low timeout values (100ms) suggest this is for testing - ensure they do not accidentally get used in production.

Recommendation:

  • Consider documenting why these changes are needed or splitting into a separate commit

Missing Elements

1. Test Coverage

There are existing WebSocket tests in tests/websocket.rs, but no tests added for this specific scenario.

Recommendation: Add a test case that:

  • Simulates a client closing the connection first
  • Verifies the server handles it gracefully without 1006 errors
  • Validates proper close frame handling in race conditions

2. Documentation

The PR description is empty. Consider adding:

  • What causes the 1006 error
  • How this fix resolves it
  • Any behavioral changes (especially the to_hyper_close change)

Performance Considerations

No significant performance concerns. The added error handling has negligible overhead.


Security Considerations

No security concerns identified. The changes do not introduce new attack vectors or expose sensitive information.


Recommendations Summary

  1. Must Fix:

    • Verify the to_hyper_close(None) behavior change is intentional and does not break other code paths
  2. Should Fix:

    • Add test coverage for client-initiated close scenarios
    • Document or separate the unrelated script changes
  3. Nice to Have:

    • Reduce logging verbosity
    • Add PR description explaining the fix

Overall, this is a solid fix for a real issue. The error handling approach is correct, and the logging will help debug future issues. The main concern is ensuring the to_hyper_close() behavior change does not have unintended side effects.

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: WebSocket Close Frame Handling Fix

Summary

This PR addresses WebSocket close frame handling issues (1006 errors) when clients close connections. The changes improve error resilience by gracefully handling failures when sending close frames to already-closing connections.


✅ What's Working Well

  1. Proper Error Handling: Converting the ? operator to explicit match statements on send() and flush() operations is the correct approach. This prevents panics when the client has already initiated the close handshake.

  2. Good Logging: The added debug logs follow the project's structured logging conventions (lowercase messages, using ?err format). The logs provide clear insight into the close handshake process.

  3. Appropriate Log Level: Using tracing::debug! for these expected failure cases is correct - these aren't errors, they're normal race conditions in WebSocket closing.

  4. Default Close Frame: Adding a default CloseCode::Normal with reason "ws.closed" when frame is None is a good defensive practice and follows the WebSocket spec better than sending no close frame.


🔍 Observations & Minor Concerns

1. Script File Changes

The changes to scripts/run/engine-rocksdb.sh appear unrelated to the WebSocket fix:

+RIVET__PEGBOARD__RETRY_RESET_DURATION="100" \
+RIVET__PEGBOARD__BASE_RETRY_TIMEOUT="100" \
+RIVET__PEGBOARD__RESCHEDULE_BACKOFF_MAX_EXPONENT="1" \
+RIVET__PEGBOARD__RUNNER_ELIGIBLE_THRESHOLD="5000" \
+RIVET__PEGBOARD__RUNNER_LOST_THRESHOLD="7000" \

Question: Are these Pegboard-related configuration changes intended for this PR, or should they be in a separate commit? These look like they might be development/testing settings rather than production configuration.

Recommendation: If these are just for local development/testing, consider:

  • Adding a comment explaining they're dev settings
  • Or moving to a separate .env.example or local config file
  • Or splitting into a separate PR if they're meant to be committed

2. Error Context

While the error handling is correct, the error objects being swallowed could potentially contain useful diagnostic information. Consider if any metrics or telemetry should be recorded for these cases to track:

  • Frequency of client-initiated closes vs server-initiated closes
  • Connection close timing patterns
  • Potential issues with specific clients

That said, since these are debug logs, this may be sufficient for diagnosing issues in practice.

3. Close Frame Reason

The default close reason "ws.closed" is fine, but you might consider:

  • Using an empty string (which is valid per RFC 6455)
  • Or a more descriptive reason like "Normal closure" to match the CloseCode::Normal

Current approach is perfectly acceptable though.


🧪 Testing Considerations

Question: Have you tested these scenarios?

  1. Client-initiated close (the 1006 case this fixes)
  2. Server-initiated close (existing behavior)
  3. Simultaneous close from both sides (race condition)
  4. Network failures during close handshake

The existing test file engine/packages/guard-core/tests/simple_websocket_test2.rs might be a good place to add a regression test for the client-initiated close case if one doesn't exist.


🎯 Security & Performance

  • Security: ✅ No security concerns. Proper error handling reduces potential DoS vectors from malformed close handshakes.
  • Performance: ✅ Negligible impact. The added error handling has minimal overhead.

📝 Code Style Compliance

  • ✅ Follows CLAUDE.md logging patterns (lowercase messages, structured logging with ? format)
  • ✅ Uses proper Rust error handling patterns
  • ✅ Appropriate use of debug logging level

Final Verdict

This is a solid bug fix that addresses a real issue with WebSocket connection handling. The core changes to proxy_service.rs are well-implemented and improve robustness.

Approved with minor questions about the script file changes. Consider addressing the script configuration question before merging, but the WebSocket handling changes themselves look good to merge.

Suggested next steps:

  1. Clarify the intent of the engine-rocksdb.sh changes
  2. Consider adding a regression test for client-initiated close scenarios
  3. If this fixes a specific issue/ticket, reference it in the PR description

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review

Summary

This PR addresses WebSocket close code 1006 errors (abnormal closure) when clients close connections. The fix converts fatal errors during the close handshake into warnings, allowing graceful degradation when the client has already closed.

Positive Aspects

  • Proper error handling with match statements instead of propagating errors
  • Good structured logging following CLAUDE.md guidelines
  • Sensible default close frame with CloseCode::Normal

Issues and Concerns

1. Unrelated Script Changes (Blocker)
The Pegboard configuration changes in scripts/run/engine-rocksdb.sh appear unrelated to the WebSocket fix. These should be in a separate PR or explained in the PR description.

2. Excessive Debug Logging (Minor)
6 debug statements for a routine operation seems excessive. Consider logging only on errors to reduce noise.

3. Missing Test Coverage (Important)
No test to prevent regression. Add a test in engine/packages/guard-core/tests/ that verifies graceful handling when client sends close frame first.

4. Race Condition Question
Is WEBSOCKET_CLOSE_LINGER (100ms) sufficient for all network conditions? Consider making it configurable.

5. Close Frame Reason (Nitpick)
The reason string "ws.closed" looks like an error code. Consider "Connection closed normally" or empty string.

Recommendations

Must Fix:

  • Remove unrelated Pegboard config changes or explain necessity

Should Fix:

  • Reduce debug logging to error cases only
  • Add test coverage for client-initiated close handshake
  • Add PR description explaining bug, root cause, and fix

Conclusion

The core fix is sound and addresses the immediate issue. With these changes, this will be a solid improvement to WebSocket reliability.

@NathanFlurry NathanFlurry force-pushed the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch from 4078059 to 04f4430 Compare November 6, 2025 02:16
@NathanFlurry NathanFlurry force-pushed the 11-05-chore_runner_add_child_logger_with_runner_id branch from 51912f8 to 4831d26 Compare November 6, 2025 02:16
@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: Fix 1006 WebSocket Close Issue

Summary

This PR addresses WebSocket close code 1006 (abnormal closure) errors by gracefully handling cases where the client has already closed the connection before the server attempts to send its close frame.


Code Quality & Best Practices ✅

Strengths:

  1. Proper error handling: The change from propagating errors (?) to explicitly handling them with match is the correct approach for this scenario
  2. Good logging: Uses structured logging with ?err and ?close_frame following the project's tracing patterns
  3. Clear comments: Explains why failures are acceptable ("This may fail if client already sent close frame, which is normal")
  4. Follows conventions: Log messages are lowercase as per CLAUDE.md guidelines

Minor suggestions:

  • The new to_hyper_close behavior (lines 2567-2572) now always sends a close frame with CloseCode::Normal and reason "ws.closed" when frame is None. Previously it sent Message::Close(None). This is a semantic change that should be explicitly mentioned in the PR description.

Potential Bugs or Issues ⚠️

  1. Race condition handling: The fix correctly addresses the race condition where client initiates close handshake before server tries to send its close frame. However, there's a potential edge case: if send() succeeds but flush() fails, the close frame may be partially sent. The current implementation handles this gracefully by logging and continuing.

  2. Error swallowing: While intentional, swallowing errors here means legitimate issues (e.g., network failures, resource exhaustion) are only logged at debug level. Consider distinguishing between "connection already closing" errors vs other error types if possible.


Performance Considerations ✅

  • No performance regression: The changes add minimal overhead (just pattern matching on error cases)
  • The WEBSOCKET_CLOSE_LINGER (100ms) delay is still applied regardless of send/flush success, which is correct for graceful closure

Security Concerns 🔒

Low risk, but worth noting:

  1. The change to always provide a close reason ("ws.closed") instead of None exposes slightly more information about the server implementation. This is minimal and likely acceptable.
  2. The close reason is properly bounded to 123 bytes (line 2549), preventing protocol violations.

Test Coverage 🧪

Recommendation: Add a test case that simulates the client-initiated close scenario to verify the fix works as expected and prevent regressions. The existing test files (tests/websocket.rs, tests/simple_websocket.rs) should be checked to ensure this scenario is covered.


Script Changes ⚠️

The scripts/run/engine-rocksdb.sh changes appear unrelated to the WebSocket fix. These Pegboard configuration changes should either be moved to a separate PR or have their purpose explained in the PR description if they're related to testing this fix.


Overall Assessment

Verdict: ✅ Approve with minor suggestions

The core WebSocket fix is solid and addresses a real race condition issue. The error handling approach is appropriate for this use case.

Action items before merge:

  1. ⚠️ Clarify or separate the engine-rocksdb.sh changes
  2. 📝 Document the semantic change in to_hyper_close (always sending close frame vs None)
  3. 🧪 Consider adding a test case for client-initiated close scenarios

Great work on identifying and fixing this race condition! The 1006 errors should be significantly reduced.

@NathanFlurry NathanFlurry force-pushed the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch from 04f4430 to d93280c Compare November 6, 2025 05:36
@NathanFlurry NathanFlurry force-pushed the 11-05-chore_runner_add_child_logger_with_runner_id branch from 4831d26 to ff5ae6a Compare November 6, 2025 05:36
@claude
Copy link

claude bot commented Nov 6, 2025

Code Review

Overview

This PR addresses WebSocket close handling by making the close frame sending and flushing operations more resilient to failures when the client has already closed the connection. This is a good defensive programming practice for WebSocket error code 1006 (abnormal closure).

Strengths

  1. Good error handling improvement: Converting from .await? to match statements with explicit error handling prevents panics when the client closes the connection first.

  2. Appropriate logging: The debug-level logging follows the project conventions (lowercase messages, structured logging with ?err).

  3. Good comments: The inline comment explaining why failures are expected is helpful for future maintainers.

  4. Consistent pattern: The same error handling is applied to both send and flush operations.

Issues and Suggestions

1. Script file changes appear unrelated

The changes to scripts/run/engine-rocksdb.sh add Pegboard configuration values but are not mentioned in the PR title/description. These seem unrelated to the WebSocket fix.

Recommendation: Consider splitting these into a separate commit/PR, or add a description explaining why these changes are needed for testing the WebSocket fix.

2. Inconsistent error handling pattern in the file

Looking at lines 1918-1925, there is an error path that still uses .await? for the same operations, and similarly at lines 1962-1969. If the client closing the connection early is a normal scenario, these other code paths should probably use the same defensive error handling.

Recommendation: Apply the same error handling pattern to all WebSocket close operations in this file for consistency.

3. Changed behavior in to_hyper_close function

The None case now returns a close frame with code Normal and reason "ws.closed" instead of Message::Close(None). While this is probably fine, it changes the wire protocol.

Questions: Is there a specific reason for this change? Have you verified this does not affect any clients expecting Close(None)?

4. Potentially excessive logging

The new code adds 4 debug log lines for each close operation. In a high-traffic environment, this might be noisy.

Recommendation: Consider consolidating to 1-2 log lines, or only log failures.

5. Missing tests

There do not appear to be any tests in the guard-core package for WebSocket handling. This specific edge case (client closes first) would be valuable to test.

Recommendation: Consider adding integration tests for WebSocket close scenarios.

Performance and Security

  • Performance: No concerns. The change only affects the close path which is already low-frequency.
  • Security: No concerns. Proper close handling actually improves security by preventing potential resource leaks from panicked connections.

Code Quality Score: 7/10

The core fix is solid, but the PR would benefit from:

  1. Splitting unrelated script changes
  2. Applying consistent error handling throughout the file
  3. Reducing log verbosity
  4. Adding tests

Final Recommendation

Approve with minor changes requested. The fix addresses a real issue, but please consider the consistency and scope concerns mentioned above.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 6, 2025

Merge activity

  • Nov 6, 5:47 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 6, 5:48 AM UTC: CI is running for this pull request on a draft pull request (#3395) due to your merge queue CI optimization settings.
  • Nov 6, 5:50 AM UTC: Merged by the Graphite merge queue via draft PR: #3395.

@graphite-app graphite-app bot closed this Nov 6, 2025
@graphite-app graphite-app bot deleted the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch November 6, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants