-
-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Summary
The test_gateway_reconnection
test fails when proximity cache broadcasts are enabled in 2-node networks. The PUT operation succeeds (contract is cached), but the response never reaches the client, causing a timeout.
Environment
- Branch:
pr-1853-clean-restart
(PR feat: proximity-based update forwarding (clean implementation) #1937) - Commit: Current HEAD (workaround removed per @sanity feedback)
- Affected Test:
crates/core/tests/connectivity.rs::test_gateway_reconnection
Symptoms
test test_gateway_reconnection has been running for over 60 seconds
thread 'test_gateway_reconnection' panicked at crates/core/tests/connectivity.rs:304:13:
error: anyhow::Error - Timeout waiting for put response
Timeline:
- Gateway and peer connect successfully
- PUT operation initiated by client
- Both nodes successfully cache the contract (logs show "Added contract to cache")
- Proximity cache broadcast sent to peer
- PUT response never arrives at client
- Test times out after 60s
Investigation Findings
✅ What's Working Correctly
-
No message loops: Proximity cache protocol is architecturally sound
handle_message()
returnsNone
forCacheAnnounce
(mod.rs:1057-1078)- Broadcasts are one-way, responses are point-to-point only
- Deduplication works via
cache.insert(hash)
-
No packet floods: Unlike the original issue in commit 1662369
- Original: 1300+ dropped packets
- Current: Only 1 dropped websocket connection (test cleanup)
- The packet flood issue appears to have been resolved
-
PUT operation succeeds:
- Contract is successfully cached on both nodes
- No errors in PUT operation logs
- Proximity cache broadcast is sent correctly
❌ The Problem
PUT response fails to reach client specifically in 2-node networks when proximity broadcasts occur.
Evidence from logs:
2025-10-19T01:32:09.913709Z INFO Created new operation - starting network request, transaction: 01K7X1GW9SSBGBY8R3RJN69M01
2025-10-19T01:32:09.941766Z INFO PROXIMITY_PROPAGATION: Added contract to cache
2025-10-19T01:32:10.075582Z INFO PROXIMITY_PROPAGATION: Added contract to cache
[60 seconds of silence]
error: anyhow::Error - Timeout waiting for put response
Current Workaround
The workaround from commit 1662369 still fixes the issue:
// In crates/core/src/node/network_bridge/p2p_protoc.rs:705
NodeEvent::BroadcastProximityCache { from, message } => {
if self.connections.len() <= 1 {
tracing::debug!("Skipping broadcast in 2-node network");
continue;
}
// ... broadcast logic
}
Test Results:
- ✅ With workaround: Test passes in 27.22s
- ❌ Without workaround: Test times out after 60s
Code References
Proximity cache broadcast sender (p2p_protoc.rs:705-740):
- Event handler spawns tasks to send broadcasts to all peers
- Each broadcast is a
NetMessage::V1(NetMessageV1::ProximityCache)
Proximity cache message receiver (mod.rs:1057-1078):
- Handles incoming
ProximityCache
messages - Updates neighbor cache knowledge
- Optionally sends point-to-point responses (for state requests only)
PUT response flow (mod.rs:907-952):
- PUT operation completes successfully
- Response should be sent back to client
- Something interferes with delivery in 2-node networks
Investigation Approach Taken
- ✅ Traced complete message flow (PUT → cache → broadcast → receive)
- ✅ Verified no message loops exist in protocol
- ✅ Confirmed PUT operation completes successfully
- ✅ Checked for packet floods (none found)
- ✅ Verified workaround still fixes the issue
- ❌ Not yet identified: Why PUT responses don't reach clients
Questions for Further Investigation
- Response routing: Does the proximity broadcast somehow interfere with the PUT response channel?
- Timing issue: Is there a race condition between broadcast completion and response delivery?
- 2-node specific: Why does this only affect 2-node networks? (3+ node tests pass)
- Client session state: Does the broadcast affect client transaction tracking?
Proposed Next Steps
Option A (Short-term): Restore workaround with improved documentation
- Re-add
connections.len() <= 1
check - Update comment to reflect new understanding (not a packet flood, but a response delivery issue)
- Document as known limitation requiring investigation
Option B (Investigation): Debug PUT response delivery
- Add detailed tracing to response routing code
- Check if
result_router_tx
channel has issues in 2-node scenarios - Examine
MessageProcessor
behavior during broadcasts
Option C (Test modification): Disable proximity cache for 2-node tests
- Production networks will have >2 nodes
- Keep proximity cache enabled for 3+ node tests
- Add separate test specifically for 2-node proximity cache behavior
Related Context
- Original workaround: commit 1662369 (Oct 7, 2025)
- Stack overflow fix: commit 1662369 (spawning broadcast tasks)
- PR review feedback: @sanity requested workaround removal
- This investigation: Attempting to address root cause per user request
[AI-assisted debugging and comment]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status