fix(guard): fix 1006 when client closes ws connection #3381

NathanFlurry · 2025-11-06T00:58:19Z

No description provided.

vercel · 2025-11-06T00:58:25Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Ready	Preview	Comment	Nov 6, 2025 5:37am

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 6, 2025 5:37am
rivet-inspector	Ignored	Preview	Nov 6, 2025 5:37am
rivet-site	Ignored	Preview	Nov 6, 2025 5:37am

NathanFlurry · 2025-11-06T00:58:33Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

pkg-pr-new · 2025-11-06T01:00:37Z

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3381

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3381

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3381

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3381

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3381

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3381

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3381

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3381

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3381

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3381

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3381

commit: d93280c

claude · 2025-11-06T01:01:32Z

Code Review

Summary

This PR fixes WebSocket close code 1006 errors that occur when clients close WebSocket connections. The fix adds proper error handling for close frame send/flush operations and ensures a proper close frame is always sent, even when no explicit close frame is provided.

Positive Aspects

1. Proper Error Handling

The addition of error handling for send() and flush() operations is the right approach. These operations can legitimately fail when the client has already initiated closure, and treating this as non-fatal is correct.

2. Good Logging

Extensive debug logging at each step helps diagnose WebSocket closure issues. The structured logging follows the project conventions (?err syntax).
Log messages are lowercase as per CLAUDE.md conventions.

3. Graceful Degradation

The code continues execution even if send/flush fails, which is appropriate since the connection is closing anyway.

Issues and Concerns

1. Overly Verbose Logging (Minor)

The current implementation has 6 debug log statements in a 30-line block. Consider consolidating some of these logs. The success cases may not need separate logs.

2. Breaking Change in to_hyper_close() (Medium Severity)

Location: proxy_service.rs:2567-2573

The function signature suggests frame: Option where None means no close frame. However, the new implementation always returns a close frame with CloseCode::Normal and reason "ws.closed".

Concerns:

This changes the semantics of passing None - previously it meant "send no close frame", now it means "send a default close frame"
The hardcoded reason "ws.closed" is not very descriptive
This may affect other call sites of to_hyper_close()

Recommendation:

Check all call sites to ensure this behavior change is intentional
Consider a more descriptive close reason or make it configurable
Document why this change was necessary for fixing the 1006 error

3. Unrelated Script Changes (Low Severity)

Location: scripts/run/engine-rocksdb.sh:10-14

The addition of Pegboard timeout configurations appears unrelated to the WebSocket close fix. These extremely low timeout values (100ms) suggest this is for testing - ensure they do not accidentally get used in production.

Recommendation:

Consider documenting why these changes are needed or splitting into a separate commit

Missing Elements

1. Test Coverage

There are existing WebSocket tests in tests/websocket.rs, but no tests added for this specific scenario.

Recommendation: Add a test case that:

Simulates a client closing the connection first
Verifies the server handles it gracefully without 1006 errors
Validates proper close frame handling in race conditions

2. Documentation

The PR description is empty. Consider adding:

What causes the 1006 error
How this fix resolves it
Any behavioral changes (especially the to_hyper_close change)

Performance Considerations

No significant performance concerns. The added error handling has negligible overhead.

Security Considerations

No security concerns identified. The changes do not introduce new attack vectors or expose sensitive information.

Recommendations Summary

Must Fix:
- Verify the to_hyper_close(None) behavior change is intentional and does not break other code paths
Should Fix:
- Add test coverage for client-initiated close scenarios
- Document or separate the unrelated script changes
Nice to Have:
- Reduce logging verbosity
- Add PR description explaining the fix

Overall, this is a solid fix for a real issue. The error handling approach is correct, and the logging will help debug future issues. The main concern is ensuring the to_hyper_close() behavior change does not have unintended side effects.

claude · 2025-11-06T01:03:19Z

Code Review: WebSocket Close Frame Handling Fix

Summary

This PR addresses WebSocket close frame handling issues (1006 errors) when clients close connections. The changes improve error resilience by gracefully handling failures when sending close frames to already-closing connections.

✅ What's Working Well

Proper Error Handling: Converting the ? operator to explicit match statements on send() and flush() operations is the correct approach. This prevents panics when the client has already initiated the close handshake.
Good Logging: The added debug logs follow the project's structured logging conventions (lowercase messages, using ?err format). The logs provide clear insight into the close handshake process.
Appropriate Log Level: Using tracing::debug! for these expected failure cases is correct - these aren't errors, they're normal race conditions in WebSocket closing.
Default Close Frame: Adding a default CloseCode::Normal with reason "ws.closed" when frame is None is a good defensive practice and follows the WebSocket spec better than sending no close frame.

🔍 Observations & Minor Concerns

1. Script File Changes

The changes to scripts/run/engine-rocksdb.sh appear unrelated to the WebSocket fix:

+RIVET__PEGBOARD__RETRY_RESET_DURATION="100" \
+RIVET__PEGBOARD__BASE_RETRY_TIMEOUT="100" \
+RIVET__PEGBOARD__RESCHEDULE_BACKOFF_MAX_EXPONENT="1" \
+RIVET__PEGBOARD__RUNNER_ELIGIBLE_THRESHOLD="5000" \
+RIVET__PEGBOARD__RUNNER_LOST_THRESHOLD="7000" \

Question: Are these Pegboard-related configuration changes intended for this PR, or should they be in a separate commit? These look like they might be development/testing settings rather than production configuration.

Recommendation: If these are just for local development/testing, consider:

Adding a comment explaining they're dev settings
Or moving to a separate .env.example or local config file
Or splitting into a separate PR if they're meant to be committed

2. Error Context

While the error handling is correct, the error objects being swallowed could potentially contain useful diagnostic information. Consider if any metrics or telemetry should be recorded for these cases to track:

Frequency of client-initiated closes vs server-initiated closes
Connection close timing patterns
Potential issues with specific clients

That said, since these are debug logs, this may be sufficient for diagnosing issues in practice.

3. Close Frame Reason

The default close reason "ws.closed" is fine, but you might consider:

Using an empty string (which is valid per RFC 6455)
Or a more descriptive reason like "Normal closure" to match the CloseCode::Normal

Current approach is perfectly acceptable though.

🧪 Testing Considerations

Question: Have you tested these scenarios?

Client-initiated close (the 1006 case this fixes)
Server-initiated close (existing behavior)
Simultaneous close from both sides (race condition)
Network failures during close handshake

The existing test file engine/packages/guard-core/tests/simple_websocket_test2.rs might be a good place to add a regression test for the client-initiated close case if one doesn't exist.

🎯 Security & Performance

Security: ✅ No security concerns. Proper error handling reduces potential DoS vectors from malformed close handshakes.
Performance: ✅ Negligible impact. The added error handling has minimal overhead.

📝 Code Style Compliance

✅ Follows CLAUDE.md logging patterns (lowercase messages, structured logging with ? format)
✅ Uses proper Rust error handling patterns
✅ Appropriate use of debug logging level

Final Verdict

This is a solid bug fix that addresses a real issue with WebSocket connection handling. The core changes to proxy_service.rs are well-implemented and improve robustness.

Approved with minor questions about the script file changes. Consider addressing the script configuration question before merging, but the WebSocket handling changes themselves look good to merge.

Suggested next steps:

Clarify the intent of the engine-rocksdb.sh changes
Consider adding a regression test for client-initiated close scenarios
If this fixes a specific issue/ticket, reference it in the PR description

claude · 2025-11-06T01:04:12Z

Code Review

Summary

This PR addresses WebSocket close code 1006 errors (abnormal closure) when clients close connections. The fix converts fatal errors during the close handshake into warnings, allowing graceful degradation when the client has already closed.

Positive Aspects

Proper error handling with match statements instead of propagating errors
Good structured logging following CLAUDE.md guidelines
Sensible default close frame with CloseCode::Normal

Issues and Concerns

1. Unrelated Script Changes (Blocker)
The Pegboard configuration changes in scripts/run/engine-rocksdb.sh appear unrelated to the WebSocket fix. These should be in a separate PR or explained in the PR description.

2. Excessive Debug Logging (Minor)
6 debug statements for a routine operation seems excessive. Consider logging only on errors to reduce noise.

3. Missing Test Coverage (Important)
No test to prevent regression. Add a test in engine/packages/guard-core/tests/ that verifies graceful handling when client sends close frame first.

4. Race Condition Question
Is WEBSOCKET_CLOSE_LINGER (100ms) sufficient for all network conditions? Consider making it configurable.

5. Close Frame Reason (Nitpick)
The reason string "ws.closed" looks like an error code. Consider "Connection closed normally" or empty string.

Recommendations

Must Fix:

Remove unrelated Pegboard config changes or explain necessity

Should Fix:

Reduce debug logging to error cases only
Add test coverage for client-initiated close handshake
Add PR description explaining bug, root cause, and fix

Conclusion

The core fix is sound and addresses the immediate issue. With these changes, this will be a solid improvement to WebSocket reliability.

claude · 2025-11-06T02:19:39Z

PR Review: Fix 1006 WebSocket Close Issue

Summary

This PR addresses WebSocket close code 1006 (abnormal closure) errors by gracefully handling cases where the client has already closed the connection before the server attempts to send its close frame.

Code Quality & Best Practices ✅

Strengths:

Proper error handling: The change from propagating errors (?) to explicitly handling them with match is the correct approach for this scenario
Good logging: Uses structured logging with ?err and ?close_frame following the project's tracing patterns
Clear comments: Explains why failures are acceptable ("This may fail if client already sent close frame, which is normal")
Follows conventions: Log messages are lowercase as per CLAUDE.md guidelines

Minor suggestions:

The new to_hyper_close behavior (lines 2567-2572) now always sends a close frame with CloseCode::Normal and reason "ws.closed" when frame is None. Previously it sent Message::Close(None). This is a semantic change that should be explicitly mentioned in the PR description.

Potential Bugs or Issues ⚠️

Race condition handling: The fix correctly addresses the race condition where client initiates close handshake before server tries to send its close frame. However, there's a potential edge case: if send() succeeds but flush() fails, the close frame may be partially sent. The current implementation handles this gracefully by logging and continuing.
Error swallowing: While intentional, swallowing errors here means legitimate issues (e.g., network failures, resource exhaustion) are only logged at debug level. Consider distinguishing between "connection already closing" errors vs other error types if possible.

Performance Considerations ✅

No performance regression: The changes add minimal overhead (just pattern matching on error cases)
The WEBSOCKET_CLOSE_LINGER (100ms) delay is still applied regardless of send/flush success, which is correct for graceful closure

Security Concerns 🔒

Low risk, but worth noting:

The change to always provide a close reason ("ws.closed") instead of None exposes slightly more information about the server implementation. This is minimal and likely acceptable.
The close reason is properly bounded to 123 bytes (line 2549), preventing protocol violations.

Test Coverage 🧪

Recommendation: Add a test case that simulates the client-initiated close scenario to verify the fix works as expected and prevent regressions. The existing test files (tests/websocket.rs, tests/simple_websocket.rs) should be checked to ensure this scenario is covered.

Script Changes ⚠️

The scripts/run/engine-rocksdb.sh changes appear unrelated to the WebSocket fix. These Pegboard configuration changes should either be moved to a separate PR or have their purpose explained in the PR description if they're related to testing this fix.

Overall Assessment

Verdict: ✅ Approve with minor suggestions

The core WebSocket fix is solid and addresses a real race condition issue. The error handling approach is appropriate for this use case.

Action items before merge:

⚠️ Clarify or separate the engine-rocksdb.sh changes
📝 Document the semantic change in to_hyper_close (always sending close frame vs None)
🧪 Consider adding a test case for client-initiated close scenarios

Great work on identifying and fixing this race condition! The 1006 errors should be significantly reduced.

claude · 2025-11-06T05:40:11Z

Code Review

Overview

This PR addresses WebSocket close handling by making the close frame sending and flushing operations more resilient to failures when the client has already closed the connection. This is a good defensive programming practice for WebSocket error code 1006 (abnormal closure).

Strengths

Good error handling improvement: Converting from .await? to match statements with explicit error handling prevents panics when the client closes the connection first.
Appropriate logging: The debug-level logging follows the project conventions (lowercase messages, structured logging with ?err).
Good comments: The inline comment explaining why failures are expected is helpful for future maintainers.
Consistent pattern: The same error handling is applied to both send and flush operations.

Issues and Suggestions

1. Script file changes appear unrelated

The changes to scripts/run/engine-rocksdb.sh add Pegboard configuration values but are not mentioned in the PR title/description. These seem unrelated to the WebSocket fix.

Recommendation: Consider splitting these into a separate commit/PR, or add a description explaining why these changes are needed for testing the WebSocket fix.

2. Inconsistent error handling pattern in the file

Looking at lines 1918-1925, there is an error path that still uses .await? for the same operations, and similarly at lines 1962-1969. If the client closing the connection early is a normal scenario, these other code paths should probably use the same defensive error handling.

Recommendation: Apply the same error handling pattern to all WebSocket close operations in this file for consistency.

3. Changed behavior in to_hyper_close function

The None case now returns a close frame with code Normal and reason "ws.closed" instead of Message::Close(None). While this is probably fine, it changes the wire protocol.

Questions: Is there a specific reason for this change? Have you verified this does not affect any clients expecting Close(None)?

4. Potentially excessive logging

The new code adds 4 debug log lines for each close operation. In a high-traffic environment, this might be noisy.

Recommendation: Consider consolidating to 1-2 log lines, or only log failures.

5. Missing tests

There do not appear to be any tests in the guard-core package for WebSocket handling. This specific edge case (client closes first) would be valuable to test.

Recommendation: Consider adding integration tests for WebSocket close scenarios.

Performance and Security

Performance: No concerns. The change only affects the close path which is already low-frequency.
Security: No concerns. Proper close handling actually improves security by preventing potential resource leaks from panicked connections.

Code Quality Score: 7/10

The core fix is solid, but the PR would benefit from:

Splitting unrelated script changes
Applying consistent error handling throughout the file
Reducing log verbosity
Adding tests

Final Recommendation

Approve with minor changes requested. The fix addresses a real issue, but please consider the consistency and scope concerns mentioned above.

graphite-app · 2025-11-06T05:47:22Z

Merge activity

Nov 6, 5:47 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 6, 5:48 AM UTC: CI is running for this pull request on a draft pull request (#3395) due to your merge queue CI optimization settings.
Nov 6, 5:50 AM UTC: Merged by the Graphite merge queue via draft PR: #3395.

NathanFlurry mentioned this pull request Nov 6, 2025

chore(rivetkit): remove warnings for graceful runner shtudown #3379

Closed

vercel bot deployed to Preview – rivetkit-serverless November 6, 2025 01:00 View deployment

NathanFlurry force-pushed the 11-05-chore_runner_add_child_logger_with_runner_id branch from c923a6f to 51912f8 Compare November 6, 2025 01:00

NathanFlurry force-pushed the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch from 500649f to 4078059 Compare November 6, 2025 01:00

This was referenced Nov 6, 2025

fix(rocksdb): fix txn conflict checker #3376

Closed

fix: boot dupe runners #3377

Closed

fix(gateway): fix gc logic #3378

Closed

vercel bot deployed to Preview – rivetkit-serverless November 6, 2025 01:01 View deployment

NathanFlurry force-pushed the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch from 4078059 to 04f4430 Compare November 6, 2025 02:16

NathanFlurry force-pushed the 11-05-chore_runner_add_child_logger_with_runner_id branch from 51912f8 to 4831d26 Compare November 6, 2025 02:16

vercel bot deployed to Preview – rivetkit-serverless November 6, 2025 02:17 View deployment

fix(guard): fix 1006 when client closes ws connection

d93280c

NathanFlurry force-pushed the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch from 04f4430 to d93280c Compare November 6, 2025 05:36

NathanFlurry force-pushed the 11-05-chore_runner_add_child_logger_with_runner_id branch from 4831d26 to ff5ae6a Compare November 6, 2025 05:36

NathanFlurry requested a review from MasterPtato November 6, 2025 05:36

NathanFlurry mentioned this pull request Nov 6, 2025

fix(rivetkit): fix initial hibernation state always being overridden with msgIndex of -1 #3389

Closed

vercel bot deployed to Preview – rivetkit-serverless November 6, 2025 05:37 View deployment

graphite-app bot pushed a commit that referenced this pull request Nov 6, 2025

fix(guard): fix 1006 when client closes ws connection (#3381)

aa4fc4c

graphite-app bot closed this Nov 6, 2025

graphite-app bot deleted the 11-05-fix_guard_fix_1006_when_client_closes_ws_connection branch November 6, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(guard): fix 1006 when client closes ws connection #3381

fix(guard): fix 1006 when client closes ws connection #3381

NathanFlurry commented Nov 6, 2025

Uh oh!

vercel bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

NathanFlurry commented Nov 6, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 6, 2025

Uh oh!

claude bot commented Nov 6, 2025

Uh oh!

claude bot commented Nov 6, 2025

Uh oh!

claude bot commented Nov 6, 2025

Uh oh!

claude bot commented Nov 6, 2025

Uh oh!

graphite-app bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(guard): fix 1006 when client closes ws connection #3381

fix(guard): fix 1006 when client closes ws connection #3381

Conversation

NathanFlurry commented Nov 6, 2025

Uh oh!

vercel bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

pkg-pr-new bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 6, 2025

Code Review

Summary

Positive Aspects

Issues and Concerns

1. Overly Verbose Logging (Minor)

2. Breaking Change in to_hyper_close() (Medium Severity)

3. Unrelated Script Changes (Low Severity)

Missing Elements

1. Test Coverage

2. Documentation

Performance Considerations

Security Considerations

Recommendations Summary

Uh oh!

claude bot commented Nov 6, 2025

Code Review: WebSocket Close Frame Handling Fix

Summary

✅ What's Working Well

🔍 Observations & Minor Concerns

1. Script File Changes

2. Error Context

3. Close Frame Reason

🧪 Testing Considerations

🎯 Security & Performance

📝 Code Style Compliance

Final Verdict

Uh oh!

claude bot commented Nov 6, 2025

Code Review

Summary

Positive Aspects

Issues and Concerns

Recommendations

Conclusion

Uh oh!

claude bot commented Nov 6, 2025

PR Review: Fix 1006 WebSocket Close Issue

Summary

Code Quality & Best Practices ✅

Potential Bugs or Issues ⚠️

Performance Considerations ✅

Security Concerns 🔒

Test Coverage 🧪

Script Changes ⚠️

Overall Assessment

Uh oh!

claude bot commented Nov 6, 2025

Code Review

Overview

Strengths

Issues and Suggestions

1. Script file changes appear unrelated

2. Inconsistent error handling pattern in the file

3. Changed behavior in to_hyper_close function

4. Potentially excessive logging

5. Missing tests

Performance and Security

Code Quality Score: 7/10

Final Recommendation

Uh oh!

graphite-app bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

vercel bot commented Nov 6, 2025 •

edited

Loading

NathanFlurry commented Nov 6, 2025 •

edited

Loading

pkg-pr-new bot commented Nov 6, 2025 •

edited

Loading

graphite-app bot commented Nov 6, 2025 •

edited

Loading