System Design ‐ Core Principles

ACID

BASE

The ACID (Atomicity-Consistency-Isolation-Durability) model used in relational databases is too strict for NoSQL databases. The BASE principle offers more flexibility, choosing availability over consistency. It states that the states will eventually be consistent.

Availability, Scalability, Throughput

High Availability This means we need to ensure a high agreed level of uptime. We often describe the design target as “3 nines” or “4 nines”. “4 nines”, 99.99% uptime, means the service can only be down 8.64 seconds per day.

To achieve high availability, we need to design redundancy in the system. There are several ways to do this:

Hot-hot: two instances receive the same input and send the output to the downstream service. In case one side is down, the other side can immediately take over. Since both sides send output to the downstream, the downstream system needs to dedupe.
Hot-warm: two instances receive the same input and only the hot side sends the output to the downstream service. In case the hot side is down, the warm side takes over and starts to send output to the downstream service.
Single-leader cluster: one leader instance receives data from the upstream system and replicates to other replicas.
Leaderless cluster: there is no leader in this type of cluster. Any write will get replicated to other instances. As long as the number of write instances plus the number of read instances are larger than the total number of instances, we should get valid data.

High Throughput This means the service needs to handle a high number of requests given a period of time. Commonly used metrics are QPS (query per second) or TPS (transaction per second).

To achieve high throughput, we often add caches to the architecture so that the request can return without hitting slower I/O devices like databases or disks. We can also increase the number of threads for computation-intensive tasks. However, adding too many threads can deteriorate the performance. We then need to identify the bottlenecks in the system and increase its throughput. Using asynchronous processing can often effectively isolate heavy-lifting components.

High Scalability This means a system can quickly and easily extend to accommodate more volume (horizontal scalability) or more functionalities (vertical scalability). Normally we watch the response time to decide if we need to scale the system.

CAP Theorem

CAP theorem states that any distributed data store can only provide two of the following three guarantees:

Consistency - Every read receives the most recent write or an error.
Availability - Every request receives a response.
Partition tolerance - The system continues to operate in network faults.

High Availability

Scalability

Scalability is the ability of a system to handle an increasing workload, either by adding more resources (scaling out) or by upgrading the capacity of existing resources (scaling up). In distributed systems, scalability is essential to ensure that the system can effectively manage the growing demands of users, data, and processing power. Here's an overview of the different aspects of scalability:

Horizontal Scaling

Horizontal scaling, also known as scaling out, involves adding more machines or nodes to a system to distribute the workload evenly. This approach allows the system to handle an increased number of requests without overloading individual nodes. Horizontal scaling is particularly useful in distributed systems because it provides a cost-effective way to manage fluctuating workloads and maintain high availability.

Vertical Scaling

Vertical scaling, or scaling up, refers to increasing the capacity of individual nodes within a system. This can be achieved by upgrading the hardware, such as adding more CPU, memory, or storage. Vertical scaling can help improve the performance of a system by allowing it to handle more workloads on a single node. However, this approach has limitations, as there is a physical limit to the amount of resources that can be added to a single machine, and it can also lead to single points of failure.

Horizontal vs. Vertical Scaling

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit. Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.

Availability

Availability is a measure of how accessible and reliable a system is to its users. In distributed systems, high availability is crucial to ensure that the system remains operational even in the face of failures or increased demand. It is the backbone that enables businesses to provide uninterrupted services to their users, regardless of any unforeseen circumstances. In today’s fast-paced digital world, where downtime can lead to significant financial losses and reputational damage, high availability has become a critical requirement for organizations across various industries.

Redundancy and failover strategies for load balancers

To ensure high availability and fault tolerance, load balancers should be designed and deployed with redundancy in mind. This means having multiple instances of load balancers that can take over if one fails. Redundancy can be achieved through several failover strategies:

Active-passive configuration:

In this setup, one load balancer (the active instance) handles all incoming traffic while the other (the passive instance) remains on standby. If the active load balancer fails, the passive instance takes over and starts processing requests. This configuration provides a simple and reliable failover mechanism but does not utilize the resources of the passive instance during normal operation.

Active-active configuration:

In this setup, multiple load balancer instances actively process incoming traffic simultaneously. Traffic is distributed among the instances using methods such as DNS load balancing or an additional load balancer layer. If one instance fails, the others continue to process traffic with minimal disruption. This configuration provides better resource utilization and increased fault tolerance compared to the active-passive setup.

Health checks and monitoring

Effective health checks and monitoring are essential components of high availability and fault tolerance for load balancers. Health checks are periodic tests performed by the load balancer to determine the availability and performance of backend servers. By monitoring the health of backend servers, load balancers can automatically remove unhealthy servers from the server pool and avoid sending traffic to them, ensuring a better user experience and preventing cascading failures.

Monitoring the load balancer itself is also crucial. By keeping track of performance metrics, such as response times, error rates, and resource utilization, we can detect potential issues and take corrective action before they lead to failures or service degradation.

In addition to regular health checks and monitoring, it is essential to have proper alerting and incident response procedures in place. This ensures that the appropriate personnel are notified of any issues and can take action to resolve them quickly.

Synchronization and State Sharing

In active-active and active-passive configurations, it is crucial to ensure that the load balancer instances maintain a consistent view of the system's state, including the status of backend servers, session data, and other configuration settings. This can be achieved through various mechanisms, such as:

Centralized configuration management:

Using a centralized configuration store (e.g., etcd, Consul, or ZooKeeper) to maintain and distribute configuration data among load balancer instances ensures that all instances are using the same settings and are aware of changes.

State sharing and replication:

In scenarios where load balancers must maintain session data or other state information, it is crucial to ensure that this data is synchronized and replicated across instances. This can be achieved through database replication, distributed caching systems (e.g., Redis or Memcached), or built-in state-sharing mechanisms provided by the load balancer software or hardware.

Fault Tolerance

Load balancing: Distribute network traffic across multiple servers, containers, or cloud instances. This helps to optimize resource utilization and avoid performance degradation.
Replication: Use multiple identical versions of systems and subsystems to ensure that they provide the same results. If the results differ, a procedure can identify the faulty system.
Backup components: Use backup components to automatically replace failed components and prevent service loss.
Circuit breaker pattern: Wrap calls to external dependencies in a circuit breaker to monitor their health and prevent cascading failures.
Fault isolation: Isolate faulty components or nodes to prevent failures from spreading to other parts of the system.
Data partitioning: Distribute data across multiple machines.
Data replication: Store each partition on multiple nodes for redundancy.
Failure detection and recovery: Implement strategies for detecting and recovering from failures.

Tips for developing fault-tolerant software

Identifying and analyzing potential failure points
Monitoring system performance in real-time
Implementing automated tests
Testing the system in a simulated environment
Using distributed systems
Taking advantage of fault detection tools

Real World Architectures, Designs

System Design ‐ Core Principles

ACID

BASE

Availability, Scalability, Throughput

CAP Theorem

High Availability

Redundancy and failover strategies for load balancers

Health checks and monitoring

Synchronization and State Sharing

Fault Tolerance

Tips for developing fault-tolerant software

Real World Architectures, Designs

References:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!