Skip to content

Cadence Shard Manager #6862

@demirkayaender

Description

@demirkayaender

Cadence has been using consistent hashing based host-shard management. However this logic comes up with significant drawbacks which come to surface at high scale. Issues include:

  • Shard Load Skew
  • Shard Data Payload Skew
  • Heterogeneous Host SKUs
  • Shared Hosts
  • Individual Host Problems
  • Shutdowns, Restarts, Host Turn-up

and the list goes on.

With consistent hashing, shards are pinned to hosts. When a host is having a problem due to the factors listed above, adding a new host doesn't guarantee taking load from the problematic host. Similarly, removing a host might just shift the issues to the next host. Even though Cadence uses and recommends using many hosts in a cluster (~16K), these issues would still happen.

A host, as long as it reports itself as healthy to the membership protocol, could get traffic even if it was shutting down or just starting up.

With a high frequency, high availability environments, this logic poses an availability risk. Cadence will implement a load aware and smarter Shard Manager that can move shards among hosts to uniformly distribute the traffic and gracefully handle host maintenance events.

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementIncremental improvement for existing features

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions