Skip to content

Conversation

@agners
Copy link
Member

@agners agners commented Oct 8, 2025

Proposed change

This PR fixes a race condition in the WebSocket proxy that causes AssertionError: assert transport is not None crashes when clients disconnect during the WebSocket handshake.

The issue occurs when a client connection is lost in the window between the Home Assistant API state check and the server.prepare(request) call. When this happens, request.transport becomes None, causing aiohttp's internal WebSocket upgrade code to hit an assertion failure at web_ws.py:317.

On Sentry we see ~7k events daily, limited to ~90 users. Something on those systems seems to constantly poke the WebSocket API unsuccessfully.

The fix adds a transport validity check before attempting the WebSocket upgrade. If the connection is already closed (request.transport is None), we log a warning and raise HTTPBadRequest with a clear reason instead of crashing with an unhandled AssertionError.

Stack trace before fix:

2025-10-08 13:41:18.229 ERROR (MainThread) [aiohttp.server] Error handling request from 172.30.32.1
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_protocol.py", line 510, in _handle_request
    resp = await request_handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_app.py", line 569, in _handle
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_middlewares.py", line 117, in impl
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/api/middleware/security.py", line 198, in block_bad_requests
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/api/middleware/security.py", line 208, in system_validation
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/api/middleware/security.py", line 225, in token_validation
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/api/middleware/security.py", line 289, in core_proxy
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/api/proxy.py", line 227, in websocket
    await server.prepare(request)
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_ws.py", line 214, in prepare
    protocol, writer = self._pre_start(request)
                       ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_ws.py", line 317, in _pre_start
    assert transport is not None
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError

Behavior after fix:

  • Graceful HTTPBadRequest response with reason "Connection closed"
  • Warning logged: "WebSocket connection lost before upgrade"
  • No more crashes

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:
  • Link to client library pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Add a transport validity check before WebSocket upgrade to prevent
AssertionError when clients disconnect during handshake.

The issue occurs when a client connection is lost between the API state
check and server.prepare() call, causing request.transport to become None
and triggering "assert transport is not None" in aiohttp's _pre_start().

The fix detects the closed connection early and raises HTTPBadRequest
with a clear reason instead of crashing with an AssertionError.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@agners agners force-pushed the fix-websocket-upgrade-race branch from 07b7bba to 1cd499b Compare October 8, 2025 12:48
@agners
Copy link
Member Author

agners commented Oct 8, 2025

Can be reproduced easily here using the following reproducer:

#!/usr/bin/env python3
"""Minimal reproducer for WebSocket transport None AssertionError.

This minimal reproducer focuses on the simplest way to trigger the race condition:
- Connect to /core/websocket
- Cancel the connection immediately after starting
- Repeat 20 times to hit the narrow timing window

The bug: request.transport becomes None between check_api_state() and server.prepare()
"""

import asyncio
import aiohttp
import os
import sys


async def main():
    """Minimal reproducer - just rapid connect/cancel cycles."""
    supervisor_host = os.getenv('SUPERVISOR', 'supervisor')
    supervisor_token = os.getenv('SUPERVISOR_TOKEN')

    if not supervisor_token:
        print("ERROR: Set SUPERVISOR_TOKEN environment variable")
        sys.exit(1)

    print(f"Connecting to: {supervisor_host}")
    print("Starting rapid connect/cancel test (20 attempts)...")
    print("Monitor logs: docker logs -f hassio_supervisor | grep -E 'AssertionError|transport|WebSocket'\n")

    headers = {'Authorization': f'Bearer {supervisor_token}'}

    for attempt in range(20):
        try:
            # Create a connection but close it immediately
            # This simulates network interruption during handshake
            async with aiohttp.ClientSession(headers=headers) as session:
                # Start WebSocket connection
                ws_task = asyncio.create_task(
                    session.ws_connect(
                        f"http://{supervisor_host}/core/websocket",
                        timeout=aiohttp.ClientTimeout(total=1.0)
                    )
                )

                # Give it a tiny moment to start, then cancel
                await asyncio.sleep(0.001)  # Very short delay
                ws_task.cancel()

                try:
                    await ws_task
                except asyncio.CancelledError:
                    pass

            print(f"  Attempt {attempt}: Connection cancelled")

        except Exception as e:
            print(f"  Attempt {attempt}: Error - {type(e).__name__}: {e}")

        # Small delay between attempts
        await asyncio.sleep(0.01)

    print("\n✅ Completed 20 attempts")
    print("\nCheck for AssertionError:")
    print("  docker logs hassio_supervisor | tail -100 | grep -A 5 AssertionError")
    print("\nOr check for fix working:")
    print("  docker logs hassio_supervisor | tail -100 | grep 'WebSocket connection lost'")


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\n\nInterrupted")

Copy link
Contributor

@erwindouna erwindouna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to throw in confusion, a genuine question: wouldn't it be safer/earlier to also take WebSocketResponse.prepare() in consideration? It is here where the handshake and writing to the transport starts, if I am correct.

If we're thinking in high-traffic systems, the connection might drop right between the prepare and request.transport. If my analysis is correct, then it might be worth wrapping the call or add a targeted test to confirm it behaves as graceful as you solve the request.transport.

@agners
Copy link
Member Author

agners commented Oct 9, 2025

Not to throw in confusion, a genuine question: wouldn't it be safer/earlier to also take WebSocketResponse.prepare() in consideration? It is here where the handshake and writing to the transport starts, if I am correct.

Hm, yeah that is a good point. The await means there is no guarantee that the prepare() coroutine gets executed right after the None check, opening up a race window still: If the connection drops right in that small window, it might cause the assertion still to happen.

What we could do is catch assertions (it is probably fine to assume that any assertion is probably connection related):

  try:
      await server.prepare(request)
  except AssertionError as err:
      # Handle transport becoming None during prepare()
      raise web.HTTPBadRequest(reason="Connection closed")

I wonder if using assertion on aiohttp side is really the right approach 🤔

@bdraco thoughts?

@erwindouna
Copy link
Contributor

Thanks for taking it in consideration. Other than assertion, a if server.prepare is None will help anyhow to safe-guard, before the try/expect - but maybe that's nitpicking and overthinking...

@agners
Copy link
Member Author

agners commented Oct 10, 2025

Thanks for taking it in consideration. Other than assertion, a if server.prepare is None will help anyhow to safe-guard, before the try/expect - but maybe that's nitpicking and overthinking...

Wouldn't if server.prepare is None evaluate to true always? That is a coroutine of the WebSocketResponse object, it should never be None. Did you meant to say if request.transport is None?

@erwindouna
Copy link
Contributor

erwindouna commented Oct 10, 2025

I've made a nasty typo whilst typing on the smartphone. You are correct. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants