-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
First, I want to express my appreciation for Socket.IO. It has been instrumental in helping us build real-time features and has served us well for several years (4). The library's ease of use and feature set helped us get to market quickly.
The Issue
After running Socket.IO in production with connections that need to stay alive for hours or days (not just minutes), we've encountered systematic false disconnections that are impacting our business operations.
Environment:
- Socket.IO client: 3.x and 4.x
- Multiple deployment scenarios tested (including Web Workers)
- Stable network conditions verified through parallel WebSocket connections
Observed Behavior:
Our server logs show a pattern of disconnection/reconnection cycles every few hours, with reconnections happening within 2-4 seconds:
10:54:17 - reconnected
10:54:16 - disconnected
09:44:13 - reconnected
09:44:11 - disconnected
08:41:12 - reconnected
08:41:10 - disconnected
These are false positives - the network connection remains stable throughout (verified by running parallel native WebSocket connections that maintain 99.9% uptime).
Critical Requirement: Connection Reliability
For our use case, connection reliability is paramount. We need connections that are only terminated by:
- Actual network failures
- Server-side decisions
- Explicit client disconnection
Connections should NEVER be disrupted by:
- Frontend code behavior (tab backgrounding, CPU load)
- Timer inaccuracies
- Browser throttling policies
- JavaScript event loop delays
In our domain, a false disconnection is worse than a delayed heartbeat. We need to trust that when a disconnection event fires, it represents a real connectivity issue that requires intervention.
Root Cause Analysis
The issue appears to stem from the heartbeat mechanism's reliance on setTimeout
, which becomes unreliable in long-running applications due to:
- Browser throttling (even in Web Workers)
- JavaScript event loop delays under load
- System-level timer coalescing
- Accumulated timer drift over time
The client-side code is making decisions about connection health that should only be made based on actual network conditions.
Validation Through Native WebSocket Implementation
To confirm this analysis, we implemented a native WebSocket wrapper that:
-
Maintains Socket.IO protocol compatibility by implementing the message format:
<packet type>[<# of binary attachments>-][<namespace>,][<acknowledgment id>][JSON-stringified payload without binary]
-
Uses RTT-based adaptive timeouts instead of fixed timers:
// Timeout dynamically adjusted based on actual network conditions const timeout = baseTimeout * (1.5 + (measuredRTT / 1000));
-
Implements heartbeat without relying on setTimeout accuracy:
- Monitors actual message activity
- Uses performance.now() for drift-free time measurement
- Validates connection through actual ping/pong exchange
Results:
- Native WebSocket wrapper: 99.9% uptime, 0 false disconnections
- Socket.IO client: ~67% perceived uptime due to false disconnections
- Both running on identical network conditions, same server
This confirms that the disconnections are not network-related but rather caused by the client-side timeout mechanism.
Business Impact
For applications requiring persistent connections (monitoring dashboards, trading platforms, real-time collaboration tools), these false disconnections cause:
- Message loss during reconnection windows
- State synchronization overhead
- Unnecessary server resource usage
- User experience degradation
- False alerts in monitoring systems
- Erosion of trust in connection status
Suggestion
While Socket.IO works excellently for short-to-medium duration connections (< 20 minutes), applications requiring long-lived, stable connections might benefit from:
- Alternative heartbeat mechanisms that don't rely solely on setTimeout
- RTT-based adaptive timeouts that adjust to actual network conditions
- Option to disable client-side timeout detection entirely, relying only on server/network signals
- Activity-based connection monitoring instead of timer-based
- Documentation guidance on when native WebSockets might be more appropriate
Our Path Forward
After extensive testing, we've migrated to native WebSocket implementations for our long-lived connections while continuing to use Socket.IO for shorter-duration features. With native WebSockets and RTT-based monitoring, we achieve true connection reliability - disconnections only occur due to real network or server events, never due to client-side timer issues.
This isn't a criticism of Socket.IO, but rather a recognition that different use cases require different tools.
Appreciation
Socket.IO has been a crucial part of our journey. This issue is shared in the spirit of helping others who might face similar challenges with long-lived connections, and to provide feedback that might help shape the library's future direction.
Thank you for all the work you've put into this library. It has helped countless developers, myself included, build real-time applications.