Skip to content

PUT response delivery fails in 2-node networks with proximity cache broadcasts #1960

@sanity

Description

@sanity

Summary

The test_gateway_reconnection test fails when proximity cache broadcasts are enabled in 2-node networks. The PUT operation succeeds (contract is cached), but the response never reaches the client, causing a timeout.

Environment

Symptoms

test test_gateway_reconnection has been running for over 60 seconds
thread 'test_gateway_reconnection' panicked at crates/core/tests/connectivity.rs:304:13:
error: anyhow::Error - Timeout waiting for put response

Timeline:

  1. Gateway and peer connect successfully
  2. PUT operation initiated by client
  3. Both nodes successfully cache the contract (logs show "Added contract to cache")
  4. Proximity cache broadcast sent to peer
  5. PUT response never arrives at client
  6. Test times out after 60s

Investigation Findings

✅ What's Working Correctly

  1. No message loops: Proximity cache protocol is architecturally sound

    • handle_message() returns None for CacheAnnounce (mod.rs:1057-1078)
    • Broadcasts are one-way, responses are point-to-point only
    • Deduplication works via cache.insert(hash)
  2. No packet floods: Unlike the original issue in commit 1662369

    • Original: 1300+ dropped packets
    • Current: Only 1 dropped websocket connection (test cleanup)
    • The packet flood issue appears to have been resolved
  3. PUT operation succeeds:

    • Contract is successfully cached on both nodes
    • No errors in PUT operation logs
    • Proximity cache broadcast is sent correctly

❌ The Problem

PUT response fails to reach client specifically in 2-node networks when proximity broadcasts occur.

Evidence from logs:

2025-10-19T01:32:09.913709Z INFO Created new operation - starting network request, transaction: 01K7X1GW9SSBGBY8R3RJN69M01
2025-10-19T01:32:09.941766Z INFO PROXIMITY_PROPAGATION: Added contract to cache
2025-10-19T01:32:10.075582Z INFO PROXIMITY_PROPAGATION: Added contract to cache
[60 seconds of silence]
error: anyhow::Error - Timeout waiting for put response

Current Workaround

The workaround from commit 1662369 still fixes the issue:

// In crates/core/src/node/network_bridge/p2p_protoc.rs:705
NodeEvent::BroadcastProximityCache { from, message } => {
    if self.connections.len() <= 1 {
        tracing::debug!("Skipping broadcast in 2-node network");
        continue;
    }
    // ... broadcast logic
}

Test Results:

  • ✅ With workaround: Test passes in 27.22s
  • ❌ Without workaround: Test times out after 60s

Code References

Proximity cache broadcast sender (p2p_protoc.rs:705-740):

  • Event handler spawns tasks to send broadcasts to all peers
  • Each broadcast is a NetMessage::V1(NetMessageV1::ProximityCache)

Proximity cache message receiver (mod.rs:1057-1078):

  • Handles incoming ProximityCache messages
  • Updates neighbor cache knowledge
  • Optionally sends point-to-point responses (for state requests only)

PUT response flow (mod.rs:907-952):

  • PUT operation completes successfully
  • Response should be sent back to client
  • Something interferes with delivery in 2-node networks

Investigation Approach Taken

  1. ✅ Traced complete message flow (PUT → cache → broadcast → receive)
  2. ✅ Verified no message loops exist in protocol
  3. ✅ Confirmed PUT operation completes successfully
  4. ✅ Checked for packet floods (none found)
  5. ✅ Verified workaround still fixes the issue
  6. Not yet identified: Why PUT responses don't reach clients

Questions for Further Investigation

  1. Response routing: Does the proximity broadcast somehow interfere with the PUT response channel?
  2. Timing issue: Is there a race condition between broadcast completion and response delivery?
  3. 2-node specific: Why does this only affect 2-node networks? (3+ node tests pass)
  4. Client session state: Does the broadcast affect client transaction tracking?

Proposed Next Steps

Option A (Short-term): Restore workaround with improved documentation

  • Re-add connections.len() <= 1 check
  • Update comment to reflect new understanding (not a packet flood, but a response delivery issue)
  • Document as known limitation requiring investigation

Option B (Investigation): Debug PUT response delivery

  • Add detailed tracing to response routing code
  • Check if result_router_tx channel has issues in 2-node scenarios
  • Examine MessageProcessor behavior during broadcasts

Option C (Test modification): Disable proximity cache for 2-node tests

  • Production networks will have >2 nodes
  • Keep proximity cache enabled for 3+ node tests
  • Add separate test specifically for 2-node proximity cache behavior

Related Context

  • Original workaround: commit 1662369 (Oct 7, 2025)
  • Stack overflow fix: commit 1662369 (spawning broadcast tasks)
  • PR review feedback: @sanity requested workaround removal
  • This investigation: Attempting to address root cause per user request

[AI-assisted debugging and comment]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions