System Design Deep Dive: Real-World Distributed Systems/

...

Fault Tolerance for Outputs and Clients

Learn how to generate outputs with fault tolerance from state machine replicas and protect them from faulty clients.

We'll cover the following...

Fault-tolerant outputs
- Outputting externally
- Outputting inside the system
Fault-tolerant clients
- Replicating the client
- Defensive programming
What's next?

We have already discussed how to make a group of state machines tolerant to faults. However, the output of the state machines goes to the output devices read by the voter devices. The output and voter devices can also fail. In this lesson, we will discuss how to deal with such failures.

Fault-tolerant outputs

If we use a single output device for an ensemble of replicas, the resulting system would not be $t$ fault-tolerant. This is because the failure of this node can result in the system being unable to generate correct outputs. Let's see how we can provide fault tolerance in this scenario:

Outputting externally

A major proportion of applications of state machine replication requires outputting to a client, system, or node not part of the group of replicas. Suppose a system of replicated state machines has an output node that collects outputs from all replicas and sends the combined output to its destination. In that case, failure of the output node will result in the system generating incorrect outputs. Therefore, we must develop a solution enabling a system to tolerate faulty output devices.

We could replicate the output node to avoid the problem mentioned above. This replication can be done when every output node combines the output of all state machine replicas and sends its output to a stream or channel where all output nodes send their outputs.

If output nodes can exhibit Byzantine failures, then the output generated by a majority of $2t+1$ replicated output nodes will provide $t$ fault tolerance. If output nodes can only exhibit fail-stop failures, then any of the replicated $t+1$ nodes will produce the correct output.

Outputting inside the system

Suppose any component inside the system has to receive the output, such as a client. In that case, it should wait for ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Fault Tolerance for Outputs and Clients

Fault-tolerant outputs

Outputting externally

Outputting inside the system