Skip to content

Commit e789eea

Browse files
committed
docs: add comprehensive investigation report for issue #2023
- Analyzed test_small_network_get_failure failure modes - Identified gateway crash root cause (fixed in a283e23) - Documented PUT operation timeout issues - Provided recommendations for re-enabling the test - Suggested modernization using #[freenet_test] macro Related: #2023, #2043, #2036, #2011
1 parent a7bfd06 commit e789eea

File tree

1 file changed

+229
-0
lines changed

1 file changed

+229
-0
lines changed

INVESTIGATION_REPORT_ISSUE_2023.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Investigation Report: Issue #2023 - test_small_network_get_failure
2+
3+
## Issue Overview
4+
Test `test_small_network_get_failure` is currently ignored due to:
5+
1. PUT operations timing out (exceeding 30-second threshold)
6+
2. Gateway process crashes during or after PUT operations
7+
8+
**Test Location**: `apps/freenet-ping/app/tests/test_small_network_get_issue.rs:24`
9+
10+
## Investigation Findings
11+
12+
### 1. Test Configuration Analysis
13+
- **Network Topology**: Star topology with 1 gateway + 3 nodes
14+
- All nodes connect only through the gateway (poor connectivity by design)
15+
- Simulates production conditions with limited peer connections
16+
- **Timeouts**:
17+
- Test client timeout: **30 seconds** (line 221)
18+
- Internal operation TTL: **60 seconds** (`OPERATION_TTL` in `config/mod.rs:40`)
19+
- Cleanup interval: **5 seconds**
20+
- **Problem**: Test times out at 30s before internal operation timeout at 60s
21+
22+
### 2. Gateway Crash Root Cause (FIXED ✅)
23+
24+
**Commit a283e23**: "fix: guard op-state timeout notifications"
25+
26+
**Problem**:
27+
```rust
28+
// OLD CODE - Would panic if receiver dropped
29+
event_loop_notifier.notifications_sender
30+
.send(Either::Right(NodeEvent::TransactionTimedOut(tx)))
31+
.await
32+
.unwrap(); // ❌ PANIC if send fails
33+
```
34+
35+
**Solution**:
36+
```rust
37+
// NEW CODE - Handles errors gracefully
38+
async fn notify_transaction_timeout(
39+
event_loop_notifier: &EventLoopNotificationsSender,
40+
tx: Transaction,
41+
) -> bool {
42+
match event_loop_notifier.notifications_sender.send(...).await {
43+
Ok(()) => true,
44+
Err(err) => {
45+
tracing::warn!("Failed to notify; receiver likely dropped");
46+
false // ✅ No panic
47+
}
48+
}
49+
}
50+
```
51+
52+
**Impact**: This was the root cause of gateway crashes during timeout scenarios. The gateway would panic when trying to send timeout notifications after the receiver channel was dropped.
53+
54+
### 3. PUT Operation Issues (PARTIALLY FIXED ⚠️)
55+
56+
#### Recent Fixes:
57+
1. **Commit 615f02d**: "fix: route successful PUT responses back through forwarding peers"
58+
- Fixed routing of SuccessfulPut messages
59+
- Ensures responses reach the originating node
60+
61+
2. **Commit 5734a33**: "fix: cache contract state locally before forwarding client-initiated PUT"
62+
- Ensures publishing node caches state immediately
63+
- Improves reliability of PUT operations
64+
65+
3. **Commit a34470b**: "feat: implement transaction atomicity with parent-child relationship"
66+
- Added proper parent-child transaction tracking
67+
- Improves composite operation reliability
68+
69+
#### Remaining Concerns:
70+
The PUT operation code (`crates/core/src/operations/put.rs`) contains multiple `unwrap()` and `expect()` calls:
71+
72+
```rust
73+
// Lines with potential panics:
74+
361: peer = %op_manager.ring.connection_manager.get_peer_key().unwrap()
75+
553: let peer = broadcast_to.get(peer_num).unwrap()
76+
596: peer = %op_manager.ring.connection_manager.get_peer_key().unwrap()
77+
...and more
78+
```
79+
80+
Most are for:
81+
- `get_peer_key().unwrap()` - Should be safe if peer is initialized
82+
- `location.expect("infallible")` - Documented as infallible
83+
- `target.unwrap()` - After checking `is_some()`
84+
85+
### 4. Timeout Configuration Mismatch
86+
87+
**Problem**: Test timeout (30s) < Operation TTL (60s)
88+
89+
```rust
90+
// In test (line 221)
91+
match timeout(Duration::from_secs(30), client_node1.recv()).await {
92+
...
93+
Err(_) => {
94+
println!("Timeout waiting for put response"); // ⏰ 30 seconds
95+
}
96+
}
97+
98+
// In config (crates/core/src/config/mod.rs:40)
99+
pub(crate) const OPERATION_TTL: Duration = Duration::from_secs(60); // ⏰ 60 seconds
100+
```
101+
102+
**Impact**: Test fails before internal cleanup can occur, making it hard to distinguish between:
103+
- Actual operation failures
104+
- Slow network propagation
105+
- Connection establishment delays
106+
107+
### 5. Test Quality Issues
108+
109+
The test uses deprecated patterns:
110+
```rust
111+
// ❌ Deprecated logging
112+
freenet::config::set_logger(Some(LevelFilter::DEBUG), None);
113+
114+
// ❌ Manual node setup (50+ lines of boilerplate)
115+
// ❌ No event aggregation for debugging
116+
// ❌ No peer identification in logs
117+
```
118+
119+
**Should use**:
120+
```rust
121+
#[freenet_test(
122+
nodes = ["gateway", "node1", "node2"],
123+
auto_connect_peers = true,
124+
aggregate_events = "on_failure",
125+
timeout_secs = 120 // ✅ Longer timeout
126+
)]
127+
async fn test_small_network_get_failure(ctx: &mut TestContext) -> TestResult {
128+
// Automatic setup, cleanup, and failure reporting
129+
}
130+
```
131+
132+
## Recommendations
133+
134+
### 1. Re-enable the Test ✅
135+
The gateway crash issue has been fixed (commit a283e23). The test should be re-enabled to verify:
136+
- Gateway stability under timeout conditions
137+
- PUT operation reliability with recent fixes
138+
139+
### 2. Increase Test Timeout
140+
```diff
141+
- match timeout(Duration::from_secs(30), client_node1.recv()).await {
142+
+ match timeout(Duration::from_secs(90), client_node1.recv()).await {
143+
```
144+
145+
**Rationale**:
146+
- Connection establishment can take 15+ seconds (test comment at line 161-166)
147+
- WASM compilation adds overhead on first execution
148+
- 90s provides buffer while staying under 120s test timeout
149+
150+
### 3. Modernize Test Implementation
151+
Convert to `#[freenet_test]` macro:
152+
- Automatic node setup and cleanup
153+
- Built-in event aggregation on failure
154+
- Proper peer identification in logs
155+
- Better failure diagnostics
156+
157+
Example:
158+
```rust
159+
#[freenet_test(
160+
nodes = ["gateway", "node1", "node2"],
161+
auto_connect_peers = true,
162+
aggregate_events = "on_failure",
163+
timeout_secs = 180,
164+
startup_wait_secs = 15
165+
)]
166+
async fn test_small_network_get_failure(ctx: &mut TestContext) -> TestResult {
167+
let gateway = ctx.gateway()?;
168+
let node1 = &ctx.peers()[0];
169+
let node2 = &ctx.peers()[1];
170+
171+
// Test logic with automatic event tracking
172+
Ok(())
173+
}
174+
```
175+
176+
### 4. Add Enhanced Debugging
177+
If test still fails, enable event aggregation:
178+
```rust
179+
let aggregator = TestAggregatorBuilder::new()
180+
.add_node("gateway", gw_temp_dir.path().join("_EVENT_LOG_LOCAL"))
181+
.add_node("node1", node1_temp_dir.path().join("_EVENT_LOG_LOCAL"))
182+
.build()
183+
.await?;
184+
185+
// Analyze PUT operation flow
186+
let flow = aggregator.get_transaction_flow(&tx_id).await?;
187+
```
188+
189+
## Testing Plan
190+
191+
1. **Phase 1**: Re-enable test with minimal changes
192+
- Remove `#[ignore]` attribute
193+
- Increase timeout to 90s
194+
- Run in CI to verify fixes
195+
196+
2. **Phase 2**: If issues persist
197+
- Add event aggregation
198+
- Analyze PUT operation flow across nodes
199+
- Identify specific failure points
200+
201+
3. **Phase 3**: Modernize implementation
202+
- Convert to `#[freenet_test]` macro
203+
- Add comprehensive logging
204+
- Document expected timing characteristics
205+
206+
## Timeline Analysis
207+
208+
Recent commits addressing these issues:
209+
- **Nov 1, 2025**: a283e23 - Fixed gateway crash (timeout notification panics)
210+
- **Oct 30, 2025**: 615f02d - Fixed PUT response routing
211+
- **Oct 28, 2025**: 5734a33 - Fixed PUT local caching
212+
- **Oct 25, 2025**: a34470b - Added transaction atomicity
213+
214+
**All critical fixes are recent (within last week)**, suggesting the test should be re-enabled to verify the fixes work as intended.
215+
216+
## Conclusion
217+
218+
**The test should be re-enabled.** The primary issues (gateway crashes and PUT operation reliability) have been addressed by recent commits. The test needs:
219+
1. Timeout increase (30s → 90s)
220+
2. Modern test implementation (`#[freenet_test]` macro)
221+
3. Event aggregation for future debugging
222+
223+
The current #[ignore] status prevents validation that these fixes actually resolve the production issues the test was designed to catch.
224+
225+
---
226+
227+
**Investigation Date**: November 4, 2025
228+
**Investigated By**: Claude (Session ID: 011CUoLcvxFtEgkZkhCQQ4Zn)
229+
**Related Issues**: #2043 (timeout panics), #2036 (PUT routing), #2011 (PUT caching)

0 commit comments

Comments
 (0)