|
| 1 | +# Investigation Report: Issue #2023 - test_small_network_get_failure |
| 2 | + |
| 3 | +## Issue Overview |
| 4 | +Test `test_small_network_get_failure` is currently ignored due to: |
| 5 | +1. PUT operations timing out (exceeding 30-second threshold) |
| 6 | +2. Gateway process crashes during or after PUT operations |
| 7 | + |
| 8 | +**Test Location**: `apps/freenet-ping/app/tests/test_small_network_get_issue.rs:24` |
| 9 | + |
| 10 | +## Investigation Findings |
| 11 | + |
| 12 | +### 1. Test Configuration Analysis |
| 13 | +- **Network Topology**: Star topology with 1 gateway + 3 nodes |
| 14 | + - All nodes connect only through the gateway (poor connectivity by design) |
| 15 | + - Simulates production conditions with limited peer connections |
| 16 | +- **Timeouts**: |
| 17 | + - Test client timeout: **30 seconds** (line 221) |
| 18 | + - Internal operation TTL: **60 seconds** (`OPERATION_TTL` in `config/mod.rs:40`) |
| 19 | + - Cleanup interval: **5 seconds** |
| 20 | +- **Problem**: Test times out at 30s before internal operation timeout at 60s |
| 21 | + |
| 22 | +### 2. Gateway Crash Root Cause (FIXED ✅) |
| 23 | + |
| 24 | +**Commit a283e23**: "fix: guard op-state timeout notifications" |
| 25 | + |
| 26 | +**Problem**: |
| 27 | +```rust |
| 28 | +// OLD CODE - Would panic if receiver dropped |
| 29 | +event_loop_notifier.notifications_sender |
| 30 | + .send(Either::Right(NodeEvent::TransactionTimedOut(tx))) |
| 31 | + .await |
| 32 | + .unwrap(); // ❌ PANIC if send fails |
| 33 | +``` |
| 34 | + |
| 35 | +**Solution**: |
| 36 | +```rust |
| 37 | +// NEW CODE - Handles errors gracefully |
| 38 | +async fn notify_transaction_timeout( |
| 39 | + event_loop_notifier: &EventLoopNotificationsSender, |
| 40 | + tx: Transaction, |
| 41 | +) -> bool { |
| 42 | + match event_loop_notifier.notifications_sender.send(...).await { |
| 43 | + Ok(()) => true, |
| 44 | + Err(err) => { |
| 45 | + tracing::warn!("Failed to notify; receiver likely dropped"); |
| 46 | + false // ✅ No panic |
| 47 | + } |
| 48 | + } |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +**Impact**: This was the root cause of gateway crashes during timeout scenarios. The gateway would panic when trying to send timeout notifications after the receiver channel was dropped. |
| 53 | + |
| 54 | +### 3. PUT Operation Issues (PARTIALLY FIXED ⚠️) |
| 55 | + |
| 56 | +#### Recent Fixes: |
| 57 | +1. **Commit 615f02d**: "fix: route successful PUT responses back through forwarding peers" |
| 58 | + - Fixed routing of SuccessfulPut messages |
| 59 | + - Ensures responses reach the originating node |
| 60 | + |
| 61 | +2. **Commit 5734a33**: "fix: cache contract state locally before forwarding client-initiated PUT" |
| 62 | + - Ensures publishing node caches state immediately |
| 63 | + - Improves reliability of PUT operations |
| 64 | + |
| 65 | +3. **Commit a34470b**: "feat: implement transaction atomicity with parent-child relationship" |
| 66 | + - Added proper parent-child transaction tracking |
| 67 | + - Improves composite operation reliability |
| 68 | + |
| 69 | +#### Remaining Concerns: |
| 70 | +The PUT operation code (`crates/core/src/operations/put.rs`) contains multiple `unwrap()` and `expect()` calls: |
| 71 | + |
| 72 | +```rust |
| 73 | +// Lines with potential panics: |
| 74 | +361: peer = %op_manager.ring.connection_manager.get_peer_key().unwrap() |
| 75 | +553: let peer = broadcast_to.get(peer_num).unwrap() |
| 76 | +596: peer = %op_manager.ring.connection_manager.get_peer_key().unwrap() |
| 77 | +...and more |
| 78 | +``` |
| 79 | + |
| 80 | +Most are for: |
| 81 | +- `get_peer_key().unwrap()` - Should be safe if peer is initialized |
| 82 | +- `location.expect("infallible")` - Documented as infallible |
| 83 | +- `target.unwrap()` - After checking `is_some()` |
| 84 | + |
| 85 | +### 4. Timeout Configuration Mismatch |
| 86 | + |
| 87 | +**Problem**: Test timeout (30s) < Operation TTL (60s) |
| 88 | + |
| 89 | +```rust |
| 90 | +// In test (line 221) |
| 91 | +match timeout(Duration::from_secs(30), client_node1.recv()).await { |
| 92 | + ... |
| 93 | + Err(_) => { |
| 94 | + println!("Timeout waiting for put response"); // ⏰ 30 seconds |
| 95 | + } |
| 96 | +} |
| 97 | + |
| 98 | +// In config (crates/core/src/config/mod.rs:40) |
| 99 | +pub(crate) const OPERATION_TTL: Duration = Duration::from_secs(60); // ⏰ 60 seconds |
| 100 | +``` |
| 101 | + |
| 102 | +**Impact**: Test fails before internal cleanup can occur, making it hard to distinguish between: |
| 103 | +- Actual operation failures |
| 104 | +- Slow network propagation |
| 105 | +- Connection establishment delays |
| 106 | + |
| 107 | +### 5. Test Quality Issues |
| 108 | + |
| 109 | +The test uses deprecated patterns: |
| 110 | +```rust |
| 111 | +// ❌ Deprecated logging |
| 112 | +freenet::config::set_logger(Some(LevelFilter::DEBUG), None); |
| 113 | + |
| 114 | +// ❌ Manual node setup (50+ lines of boilerplate) |
| 115 | +// ❌ No event aggregation for debugging |
| 116 | +// ❌ No peer identification in logs |
| 117 | +``` |
| 118 | + |
| 119 | +**Should use**: |
| 120 | +```rust |
| 121 | +#[freenet_test( |
| 122 | + nodes = ["gateway", "node1", "node2"], |
| 123 | + auto_connect_peers = true, |
| 124 | + aggregate_events = "on_failure", |
| 125 | + timeout_secs = 120 // ✅ Longer timeout |
| 126 | +)] |
| 127 | +async fn test_small_network_get_failure(ctx: &mut TestContext) -> TestResult { |
| 128 | + // Automatic setup, cleanup, and failure reporting |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +## Recommendations |
| 133 | + |
| 134 | +### 1. Re-enable the Test ✅ |
| 135 | +The gateway crash issue has been fixed (commit a283e23). The test should be re-enabled to verify: |
| 136 | +- Gateway stability under timeout conditions |
| 137 | +- PUT operation reliability with recent fixes |
| 138 | + |
| 139 | +### 2. Increase Test Timeout |
| 140 | +```diff |
| 141 | +- match timeout(Duration::from_secs(30), client_node1.recv()).await { |
| 142 | ++ match timeout(Duration::from_secs(90), client_node1.recv()).await { |
| 143 | +``` |
| 144 | + |
| 145 | +**Rationale**: |
| 146 | +- Connection establishment can take 15+ seconds (test comment at line 161-166) |
| 147 | +- WASM compilation adds overhead on first execution |
| 148 | +- 90s provides buffer while staying under 120s test timeout |
| 149 | + |
| 150 | +### 3. Modernize Test Implementation |
| 151 | +Convert to `#[freenet_test]` macro: |
| 152 | +- Automatic node setup and cleanup |
| 153 | +- Built-in event aggregation on failure |
| 154 | +- Proper peer identification in logs |
| 155 | +- Better failure diagnostics |
| 156 | + |
| 157 | +Example: |
| 158 | +```rust |
| 159 | +#[freenet_test( |
| 160 | + nodes = ["gateway", "node1", "node2"], |
| 161 | + auto_connect_peers = true, |
| 162 | + aggregate_events = "on_failure", |
| 163 | + timeout_secs = 180, |
| 164 | + startup_wait_secs = 15 |
| 165 | +)] |
| 166 | +async fn test_small_network_get_failure(ctx: &mut TestContext) -> TestResult { |
| 167 | + let gateway = ctx.gateway()?; |
| 168 | + let node1 = &ctx.peers()[0]; |
| 169 | + let node2 = &ctx.peers()[1]; |
| 170 | + |
| 171 | + // Test logic with automatic event tracking |
| 172 | + Ok(()) |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | +### 4. Add Enhanced Debugging |
| 177 | +If test still fails, enable event aggregation: |
| 178 | +```rust |
| 179 | +let aggregator = TestAggregatorBuilder::new() |
| 180 | + .add_node("gateway", gw_temp_dir.path().join("_EVENT_LOG_LOCAL")) |
| 181 | + .add_node("node1", node1_temp_dir.path().join("_EVENT_LOG_LOCAL")) |
| 182 | + .build() |
| 183 | + .await?; |
| 184 | + |
| 185 | +// Analyze PUT operation flow |
| 186 | +let flow = aggregator.get_transaction_flow(&tx_id).await?; |
| 187 | +``` |
| 188 | + |
| 189 | +## Testing Plan |
| 190 | + |
| 191 | +1. **Phase 1**: Re-enable test with minimal changes |
| 192 | + - Remove `#[ignore]` attribute |
| 193 | + - Increase timeout to 90s |
| 194 | + - Run in CI to verify fixes |
| 195 | + |
| 196 | +2. **Phase 2**: If issues persist |
| 197 | + - Add event aggregation |
| 198 | + - Analyze PUT operation flow across nodes |
| 199 | + - Identify specific failure points |
| 200 | + |
| 201 | +3. **Phase 3**: Modernize implementation |
| 202 | + - Convert to `#[freenet_test]` macro |
| 203 | + - Add comprehensive logging |
| 204 | + - Document expected timing characteristics |
| 205 | + |
| 206 | +## Timeline Analysis |
| 207 | + |
| 208 | +Recent commits addressing these issues: |
| 209 | +- **Nov 1, 2025**: a283e23 - Fixed gateway crash (timeout notification panics) |
| 210 | +- **Oct 30, 2025**: 615f02d - Fixed PUT response routing |
| 211 | +- **Oct 28, 2025**: 5734a33 - Fixed PUT local caching |
| 212 | +- **Oct 25, 2025**: a34470b - Added transaction atomicity |
| 213 | + |
| 214 | +**All critical fixes are recent (within last week)**, suggesting the test should be re-enabled to verify the fixes work as intended. |
| 215 | + |
| 216 | +## Conclusion |
| 217 | + |
| 218 | +**The test should be re-enabled.** The primary issues (gateway crashes and PUT operation reliability) have been addressed by recent commits. The test needs: |
| 219 | +1. Timeout increase (30s → 90s) |
| 220 | +2. Modern test implementation (`#[freenet_test]` macro) |
| 221 | +3. Event aggregation for future debugging |
| 222 | + |
| 223 | +The current #[ignore] status prevents validation that these fixes actually resolve the production issues the test was designed to catch. |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +**Investigation Date**: November 4, 2025 |
| 228 | +**Investigated By**: Claude (Session ID: 011CUoLcvxFtEgkZkhCQQ4Zn) |
| 229 | +**Related Issues**: #2043 (timeout panics), #2036 (PUT routing), #2011 (PUT caching) |
0 commit comments