Skip to content

Long-lived connections experience false disconnections due to timer reliability issues #5357

@matiaslopezd

Description

@matiaslopezd

First, I want to express my appreciation for Socket.IO. It has been instrumental in helping us build real-time features and has served us well for several years (4). The library's ease of use and feature set helped us get to market quickly.

The Issue

After running Socket.IO in production with connections that need to stay alive for hours or days (not just minutes), we've encountered systematic false disconnections that are impacting our business operations.

Environment:

  • Socket.IO client: 3.x and 4.x
  • Multiple deployment scenarios tested (including Web Workers)
  • Stable network conditions verified through parallel WebSocket connections

Observed Behavior:
Our server logs show a pattern of disconnection/reconnection cycles every few hours, with reconnections happening within 2-4 seconds:

10:54:17 - reconnected
10:54:16 - disconnected 
09:44:13 - reconnected
09:44:11 - disconnected
08:41:12 - reconnected
08:41:10 - disconnected

These are false positives - the network connection remains stable throughout (verified by running parallel native WebSocket connections that maintain 99.9% uptime).

Critical Requirement: Connection Reliability

For our use case, connection reliability is paramount. We need connections that are only terminated by:

  • Actual network failures
  • Server-side decisions
  • Explicit client disconnection

Connections should NEVER be disrupted by:

  • Frontend code behavior (tab backgrounding, CPU load)
  • Timer inaccuracies
  • Browser throttling policies
  • JavaScript event loop delays

In our domain, a false disconnection is worse than a delayed heartbeat. We need to trust that when a disconnection event fires, it represents a real connectivity issue that requires intervention.

Root Cause Analysis

The issue appears to stem from the heartbeat mechanism's reliance on setTimeout, which becomes unreliable in long-running applications due to:

  1. Browser throttling (even in Web Workers)
  2. JavaScript event loop delays under load
  3. System-level timer coalescing
  4. Accumulated timer drift over time

The client-side code is making decisions about connection health that should only be made based on actual network conditions.

Validation Through Native WebSocket Implementation

To confirm this analysis, we implemented a native WebSocket wrapper that:

  1. Maintains Socket.IO protocol compatibility by implementing the message format:

    <packet type>[<# of binary attachments>-][<namespace>,][<acknowledgment id>][JSON-stringified payload without binary]
    
  2. Uses RTT-based adaptive timeouts instead of fixed timers:

    // Timeout dynamically adjusted based on actual network conditions
    const timeout = baseTimeout * (1.5 + (measuredRTT / 1000));
  3. Implements heartbeat without relying on setTimeout accuracy:

    • Monitors actual message activity
    • Uses performance.now() for drift-free time measurement
    • Validates connection through actual ping/pong exchange

Results:

  • Native WebSocket wrapper: 99.9% uptime, 0 false disconnections
  • Socket.IO client: ~67% perceived uptime due to false disconnections
  • Both running on identical network conditions, same server

This confirms that the disconnections are not network-related but rather caused by the client-side timeout mechanism.

Business Impact

For applications requiring persistent connections (monitoring dashboards, trading platforms, real-time collaboration tools), these false disconnections cause:

  • Message loss during reconnection windows
  • State synchronization overhead
  • Unnecessary server resource usage
  • User experience degradation
  • False alerts in monitoring systems
  • Erosion of trust in connection status

Suggestion

While Socket.IO works excellently for short-to-medium duration connections (< 20 minutes), applications requiring long-lived, stable connections might benefit from:

  1. Alternative heartbeat mechanisms that don't rely solely on setTimeout
  2. RTT-based adaptive timeouts that adjust to actual network conditions
  3. Option to disable client-side timeout detection entirely, relying only on server/network signals
  4. Activity-based connection monitoring instead of timer-based
  5. Documentation guidance on when native WebSockets might be more appropriate

Our Path Forward

After extensive testing, we've migrated to native WebSocket implementations for our long-lived connections while continuing to use Socket.IO for shorter-duration features. With native WebSockets and RTT-based monitoring, we achieve true connection reliability - disconnections only occur due to real network or server events, never due to client-side timer issues.

This isn't a criticism of Socket.IO, but rather a recognition that different use cases require different tools.

Appreciation

Socket.IO has been a crucial part of our journey. This issue is shared in the spirit of helping others who might face similar challenges with long-lived connections, and to provide feedback that might help shape the library's future direction.

Thank you for all the work you've put into this library. It has helped countless developers, myself included, build real-time applications.

Metadata

Metadata

Assignees

No one assigned

    Labels

    to triageWaiting to be triaged by a member of the team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions