Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 7:00pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 7:00pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 7:00pm
rivet-site Ignored Ignored Preview Nov 4, 2025 7:00pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

PR Review: Fix tunnel ack from gateway

Summary

This PR fixes a critical bug in the tunnel acknowledgment mechanism where the gateway was generating a new UUID for the message_id field instead of echoing back the original message_id from the incoming message.

Analysis

The Bug:
The gateway was creating acknowledgments with message_id: Uuid::new_v4().into_bytes() instead of using message_id: msg.message_id. This broke the ack matching logic on the runner side.

Why This Matters:
Looking at the runner-side code in tunnel.ts:247-253, when a TunnelAck is received, it uses the message_id to find and remove the corresponding entry from #pendingTunnelMessages:

if (message.messageKind.tag === "TunnelAck") {
    const msgIdStr = bufferToString(message.messageId);
    const pending = this.#pendingTunnelMessages.get(msgIdStr);
    if (pending) {
        this.#pendingTunnelMessages.delete(msgIdStr);
    }
}

With the wrong message_id, the runner would never find the matching pending message, causing:

  1. Messages to remain in #pendingTunnelMessages indefinitely
  2. Eventually triggering the garbage collector timeout (5 seconds, see MESSAGE_ACK_TIMEOUT)
  3. Requests failing with "Message acknowledgment timeout" errors
  4. WebSocket connections being forcibly closed

The Fix:
The change correctly echoes back the original message_id from the incoming message, allowing the runner to properly match and clear acknowledged messages.

Code Quality Assessment

Positives:

  • ✅ The fix is minimal, surgical, and correct
  • ✅ Follows the protocol specification (see v1.bare:297-308) where ack messages should reference the original message
  • ✅ Consistent with how acks work in bidirectional communication protocols
  • ✅ The surrounding logging already includes the message_id for debugging (shared_state.rs:210)

Concerns:

  1. Test Coverage ⚠️

    • No tests exist for the pegboard-gateway package
    • This critical path (tunnel message acking) has no automated verification
    • The bug suggests this code path may not be well-tested in integration tests either
    • Recommendation: Consider adding integration tests that verify:
      • Messages are properly acknowledged
      • The #pendingTunnelMessages map is cleared after acks
      • Timeout behavior works correctly when acks are missing
  2. Observability ℹ️

    • The existing warning at shared_state.rs:232-234 would have fired when the runner sent acks with the correct message_id but they didn't match anything on the gateway side
    • Consider if additional metrics/tracing would help catch similar issues earlier
    • The log message "pending message does not exist or ack received after message body" could be more specific about whether it is the former or latter case
  3. Documentation ℹ️

    • No inline comments explain the ack matching protocol
    • Future developers might make the same mistake without understanding why the message_id must be echoed
    • Recommendation: Consider adding a comment like:
      // Echo back the message_id so the runner can match and clear
      // the pending message from its tracking map
      message_id: msg.message_id,

Performance Considerations

  • ✅ No performance impact - actually improves performance by preventing unnecessary timeouts and retries
  • ✅ Reduces memory pressure by allowing pending messages to be properly cleared

Security Considerations

  • ✅ No security implications
  • ✅ Fix actually improves reliability which indirectly benefits availability

Best Practices

  • ✅ Follows Rust idioms
  • ✅ Consistent with codebase patterns
  • ✅ Minimal change reduces risk

Verdict

LGTM with recommendations

The fix is correct and necessary. The change properly implements the tunnel acknowledgment protocol. However, I strongly recommend:

  1. Adding test coverage for this critical communication path
  2. Consider adding inline documentation about the ack protocol
  3. Review monitoring/alerting around tunnel message timeouts

This is a high-priority fix that should be merged quickly, but follow-up work on testing would be valuable.


Review generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_pegboard_fix_tunnel_ack_from_gateway branch from 2769824 to 43b9be8 Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_rivetkit_explicitly_handle_errors_in__onstop branch from 0bb481c to 158182f Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix tunnel ack from gateway

Summary

This PR fixes a critical bug in the pegboard gateway's tunnel acknowledgment mechanism. The change is small but important.

The Bug

Before: When the gateway sent a TunnelAck message back to the runner, it was generating a new random UUID (Uuid::new_v4()) for the message_id field.

After: The gateway now correctly echoes back the original message_id from the incoming message.

Why This Matters

Looking at the TypeScript runner code (engine/sdks/typescript/runner/src/tunnel.ts:247-253), when a TunnelAck is received:

if (message.messageKind.tag === "TunnelAck") {
    // Mark pending message as acknowledged and remove it
    const msgIdStr = bufferToString(message.messageId);
    const pending = this.#pendingTunnelMessages.get(msgIdStr);
    if (pending) {
        this.#pendingTunnelMessages.delete(msgIdStr);
    }
}

The runner uses the message.messageId to look up and remove the pending message from its tracking map. By sending back a random UUID instead of the original message ID, the ack would never match the pending message, causing:

  1. Memory leak: Pending messages would never be removed from #pendingTunnelMessages
  2. Incorrect state: The runner would think messages were never acknowledged
  3. Potential retry logic issues: If there's retry logic based on unacknowledged messages

Code Quality

Correct fix: The change properly echoes back the message ID from the incoming message
Minimal change: Only touches what's necessary
Follows protocol: Matches the expected acknowledgment pattern (similar to how TCP ACKs work)

Potential Concerns

⚠️ Test Coverage: This bug suggests there might be missing integration tests for the ack mechanism. Consider adding tests that verify:

  • Acks contain the correct message_id
  • Pending messages are properly removed upon ack receipt
  • The full round-trip of message → ack → cleanup

Recommendation

LGTM

This is a critical bug fix that corrects the tunnel acknowledgment protocol. The fix is correct and should prevent memory leaks and state tracking issues in the runner.

Consider adding integration tests to prevent regression of this issue in the future.

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR fixes a critical bug in the tunnel acknowledgment mechanism where the gateway was generating a new UUID for the ack message instead of echoing back the original message_id.

Analysis

The Bug:

  • Location: engine/packages/pegboard-gateway/src/shared_state.rs:253
  • Issue: When the gateway receives a message from the runner and sends back an acknowledgment, it was incorrectly using Uuid::new_v4().into_bytes() to generate a new message_id instead of using msg.message_id
  • Impact: The runner would never be able to match the acknowledgment to the original message it sent, causing the acknowledgment tracking system to fail

The Fix:

  • Changed line 253 from message_id: Uuid::new_v4().into_bytes(), to message_id: msg.message_id,
  • This ensures the ack contains the same message_id as the incoming message, allowing proper message tracking

Why This Matters:
Looking at the code flow:

  1. When the gateway sends a message to the runner (send_message at line 114-195), it generates a message_id and stores it in pending_msgs (line 159-162)
  2. When it receives a TunnelAck back from the runner (line 224-235), it removes the pending message by matching the message_id (line 229)
  3. Similarly, when the gateway receives a message from the runner and sends an ack back, the runner needs to match that ack's message_id to its pending messages

The bug broke this bidirectional acknowledgment protocol by returning the wrong message_id.

Code Quality

Excellent - This is a minimal, surgical fix that addresses exactly the issue at hand

Potential Issues

⚠️ Test Coverage: The codebase appears to lack tests for this acknowledgment flow. Consider adding integration tests that verify:

  • Message IDs are correctly echoed in acknowledgments
  • Pending messages are properly tracked and removed
  • The warning at line 232-234 is triggered when an ack is received for a non-existent message

Performance Considerations

No impact - The fix actually has a tiny performance improvement (one fewer UUID generation)

Security Concerns

None - The fix maintains the same security posture

Recommendations

  1. Approve this fix - It correctly resolves the acknowledgment matching issue
  2. 📝 Add tests - Consider adding tests for the tunnel message acknowledgment flow to prevent regression
  3. 🔍 Audit similar patterns - Verify there are no other places where message IDs might be incorrectly regenerated instead of echoed

Convention Compliance

✅ Code follows the repository's style guidelines per CLAUDE.md

Verdict: LGTM - This is a well-identified bug fix that restores correct protocol behavior. Recommend merging.

@claude
Copy link

claude bot commented Nov 4, 2025

Pull Request Review

Summary

This PR fixes a critical bug in the tunnel acknowledgment flow from the gateway. The fix changes the message_id in the acknowledgment message from a newly generated UUID to the actual message_id from the incoming message.

Analysis

The Bug:
In shared_state.rs:253, when the gateway sends a TunnelAck back to the runner, it was incorrectly generating a new UUID with Uuid::new_v4().into_bytes() instead of echoing back the original message_id from the incoming message.

Why This is a Problem:
Looking at the acknowledgment handling code in shared_state.rs:225-234, the gateway maintains a list of pending messages (pending_msgs) for each in-flight request. When a TunnelAck is received from the runner, it uses the message_id to remove the corresponding pending message:

in_flight.pending_msgs.retain(|m| m.message_id \!= msg.message_id);

Because the gateway was sending back a different message_id than what it originally sent, the runner's ack would never match any pending message. This would cause:

  1. The warning "pending message does not exist or ack received after message body" to be logged (line 232-234)
  2. Pending messages to accumulate and never be cleared
  3. Potential timeouts and reliability issues in the tunnel message flow

The Fix:
The fix correctly changes line 253 to use msg.message_id instead of generating a new UUID. This ensures the acknowledgment contains the same message_id as the original message, allowing the ack tracking system to work properly.

Code Quality Assessment

Strengths:

  • The fix is minimal and surgical - exactly one line changed
  • Follows the correct pattern used elsewhere in the protocol (see lines 105-112 in versioned.rs where message_id is preserved during protocol version conversions)
  • Directly addresses the root cause

Considerations:

  • No test coverage appears to be added for this fix
  • This bug would have caused reliability issues in production - consider adding integration tests that verify:
    • Pending messages are properly cleared after acks
    • No warnings about missing pending messages
    • Message acknowledgment flow works end-to-end

Security & Performance

Security: No security concerns. This is a bug fix that improves reliability.

Performance: This fix actually improves performance by:

  • Preventing accumulation of pending messages that never get cleared
  • Reducing unnecessary warning log spam
  • Ensuring the timeout mechanism works correctly (see shared_state.rs:420-422)

Recommendation

LGTM with suggestions

The fix is correct and addresses a real bug in the acknowledgment flow. However, I recommend:

  1. Add test coverage - Consider adding a test that verifies the ack flow works correctly and pending messages are cleared
  2. Consider adding a comment - A brief comment explaining that the message_id must be echoed back for ack tracking would help future maintainers

Great catch on this bug!

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_rivetkit_explicitly_handle_errors_in__onstop branch from 0f23c20 to becb4b8 Compare November 4, 2025 19:24
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_pegboard_fix_tunnel_ack_from_gateway branch from d922f4c to 202987a Compare November 4, 2025 19:24
@vercel
Copy link

vercel bot commented Nov 4, 2025

Deployment failed with the following error:

Too many requests - try again in 4 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR fixes a critical bug in the tunnel acknowledgment mechanism between the pegboard gateway and runners. The fix ensures that acknowledgment messages correctly reference the original message ID.

Issue Identified

Before: The gateway was generating a new UUID for the message_id field when sending acknowledgments:

message_id: Uuid::new_v4().into_bytes(),

After: The gateway now correctly echoes back the original message ID from the incoming message:

message_id: msg.message_id,

Impact Analysis

Why This Bug Matters

Looking at the runner-side code (engine/sdks/typescript/runner/src/tunnel.ts:247-253), when a TunnelAck is received, it uses the message_id to find and remove the pending message from tracking:

if (message.messageKind.tag === "TunnelAck") {
    const msgIdStr = bufferToString(message.messageId);
    const pending = this.#pendingTunnelMessages.get(msgIdStr);
    if (pending) {
        this.#pendingTunnelMessages.delete(msgIdStr);
    }
}

With the bug, the randomly generated message_id in the ack would never match any pending message, meaning:

  1. Acks would never clear pending messages from the runner's tracking map
  2. The garbage collector would eventually timeout these "unacked" messages after 5 seconds (MESSAGE_ACK_TIMEOUT)
  3. HTTP requests would be rejected with "Tunnel message not acknowledged in time"
  4. WebSocket connections would be forcibly closed

This would cause all tunnel messages (HTTP requests and WebSocket operations) to fail after 5 seconds, even though they were actually acknowledged by the gateway.

Code Quality Assessment

Positive:

  • ✅ Simple, surgical fix that addresses the root cause
  • ✅ Follows the established protocol pattern (acknowledging with the same message_id)
  • ✅ Consistent with how acks work in bidirectional protocols
  • ✅ No breaking changes to the protocol schema

Suggestions:

  1. Test Coverage: This bug suggests the tunnel ack mechanism may not have integration test coverage. Consider adding tests that verify:

    • Acks correctly clear pending messages
    • The message_id in acks matches the original message
    • Messages don't timeout when properly acked
  2. Protocol Validation: Consider adding debug assertions or validation to catch this class of bug:

    // In development, validate acks echo the message_id
    debug_assert_eq!(ack_message.message_id, original_message.message_id, 
        "Ack message_id must match original message_id");
  3. Documentation: The protocol schema (v2.bare:307-318) documents the ToClientTunnelMessage structure but doesn't specify that acks must echo the message_id. Adding this requirement to documentation could prevent similar bugs.

Security Considerations

No security concerns. This fix actually improves reliability by ensuring the acknowledgment protocol works correctly.

Performance Considerations

Positive performance impact:

  • Eliminates unnecessary 5-second timeouts for all tunnel messages
  • Reduces memory pressure from pending message accumulation
  • Prevents spurious request failures and retries

Conclusion

LGTM ✅ This is an important bug fix that restores proper functionality of the tunnel acknowledgment system. The change is minimal, correct, and should be merged promptly.

Recommendation: After merging, consider adding integration tests to prevent regression of this critical path.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:52 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:53 PM UTC: CI is running for this pull request on a draft pull request (#3353) due to your merge queue CI optimization settings.
  • Nov 4, 8:56 PM UTC: Merged by the Graphite merge queue via draft PR: #3353.

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_rivetkit_explicitly_handle_errors_in__onstop branch from becb4b8 to 96c8a07 Compare November 4, 2025 20:25
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_pegboard_fix_tunnel_ack_from_gateway branch from 202987a to 0e9f26e Compare November 4, 2025 20:25
@vercel
Copy link

vercel bot commented Nov 4, 2025

Deployment failed with the following error:

Too many requests - try again in 5 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR fixes a critical bug in the tunnel acknowledgment mechanism within the pegboard gateway. The change correctly uses the original message ID from the incoming message when sending acknowledgments, rather than generating a new random UUID.

Analysis

The Bug:
The previous code (line 253 in shared_state.rs) was generating a new UUID for the message_id field when creating a TunnelAck response:

message_id: Uuid::new_v4().into_bytes(),  // ❌ Wrong - generates new ID

The Fix:
The corrected code now uses the original message ID from the incoming message:

message_id: msg.message_id,  // ✅ Correct - uses original ID

Why This Matters

Looking at the TypeScript runner implementation (tunnel.ts:247-253), when a TunnelAck is received, it attempts to remove the pending message from #pendingTunnelMessages using the message ID:

if (message.messageKind.tag === "TunnelAck") {
    const msgIdStr = bufferToString(message.messageId);
    const pending = this.#pendingTunnelMessages.get(msgIdStr);
    if (pending) {
        this.#pendingTunnelMessages.delete(msgIdStr);
    }
}

With the bug, the gateway was sending back a different message ID in the ack than what was originally sent. This means:

  1. The runner would never find a matching pending message
  2. Pending messages would never be removed from the map
  3. Eventually, the garbage collector would time them out after 5 seconds (MESSAGE_ACK_TIMEOUT)
  4. This could lead to unnecessary request retries and failures

Correctness Verification

The fix aligns with the protocol pattern seen throughout the codebase:

  • When sending messages to the runner, the gateway generates new message IDs (line 119: Uuid::new_v4().as_bytes())
  • When acknowledging messages from the runner, the gateway must echo back the same message ID

This is confirmed by examining lines 224-236 where incoming TunnelAck messages are processed:

if let protocol::ToServerTunnelMessageKind::TunnelAck = &msg.message_kind {
    in_flight.pending_msgs.retain(|m| m.message_id != msg.message_id);
}

The gateway itself expects to receive back the original message ID when it sends messages to the runner.

Impact Assessment

Severity: High
Type: Correctness bug

This bug would cause:

  • ✅ Message delivery still works (messages were sent)
  • ❌ Acknowledgment tracking completely broken
  • ❌ Memory leak in pending message maps
  • ❌ False timeout errors after 5 seconds
  • ❌ Potential cascading failures in long-lived connections

Testing Recommendations

While I don't see existing tests for this module, this change would benefit from:

  1. Integration test verifying ack message IDs match original message IDs
  2. Test ensuring pending messages are properly cleaned up after acks
  3. Test verifying no timeout errors occur when acks are properly received

Code Quality

Follows conventions: Uses existing message fields correctly
Minimal change: One-line fix with clear intent
No side effects: Pure correctness fix
Well-scoped: Addresses exactly the issue in the title

Verdict

LGTM - This is an excellent bug fix that corrects a fundamental protocol violation. The change is minimal, correct, and should be merged promptly to prevent acknowledgment tracking failures in production.


Review generated with assistance from Claude Code

graphite-app bot pushed a commit that referenced this pull request Nov 4, 2025
@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-fix_pegboard_fix_tunnel_ack_from_gateway branch November 4, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants