Skip to content

Incorrect state calculation on CSAPI endpoints #18936

@kegsay

Description

@kegsay

Description

There are several CSAPI endpoints like GET /_matrix/client/v3/rooms/{roomId}/members which accepts an ?at= param which:

The point in time (pagination token) to return members for in the room. This token can be obtained from a prev_batch token returned for each room by the sync API. Defaults to the current state of the room, as determined by the server.

These are almost always incorrectly implemented in Synapse. They are usually implemented as:

last_event_id = (
    await self.store.get_last_event_id_in_room_before_stream_ordering(
        room_id,
        end_token=at_token.room_key,
    )
)

room_state_events = (
    await self._state_storage_controller.get_state_for_events(
        [last_event_id], state_filter=state_filter
    )
)

Steps to reproduce

This pattern neglects to consider concurrent events in the DAG, making it inappropriate for use to determine what the member list was like at a certain event. As an example:

   A join
    |
   B join
    |     \ 
   C join  D join

Assume events arrived in alphabetical order. What should /members return if you specify ?at=D?
Currently, Synapse and I suspect other implementations simply calculate the room state after the event
and return that, which would be {A,B,D}. In practice however, users live streaming when these events
arrived would have seen {A,B,C,D} as the member list after D. This subverts the entire point of the
?at= parameter, which is to try to figure out the member list at a particular historical point in time.

In the worst case, this is a footgun as clients may be thinking that by using ?at= they ensure that
they are protected against race conditions when querying the CSAPI, but in practice they are accidentally
asking for subtly different dataset.

We should always be using the current_state_delta_stream to ensure we handle concurrent events.
We should remove get_last_event_id_in_room_before_stream_ordering as it always (?) is the wrong thing to do. Even callers which do use it have FIXMEs to say it's wrong for the reasons I have outlined:

        # FIXME: This gets the state at the latest event before the stream ordering,
        # which might not be the same as the "current state" of the room at the time
        # of the stream token if there were multiple forward extremities at the time.

The net result of failing to consider this is:

  • We may incorrectly include state we should have excluded (if we do not consider a concurrent revocation e.g a change in PLs that invalidates some state)
  • We may incorrectly exclude state we should have included (if we do not considder a concurrent addition e.g a room member joining)

Affected endpoints

  • GET /_matrix/client/v3/rooms/{roomId}/members
  • /sync when used without ?state_after=
  • All endpoints which fetch the room state for left/banned users (the leave/ban event is used solely to determine room state at the time), including /sync, sliding sync, /state endpoints
  • Sliding sync: calculation of room name, historical rooms

Synapse Version

v1.138.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions