Advanced System Design Interview Prep: Crash Course/

...

State Machine Replication (SMR)

Learn how to build replicated data stores that can tolerate faults.

Motivation

Providing fault-tolerant services to clients is a desirable property of a system. State machine replication (SMR) is a mechanism to implement fault-tolerant services. SMR models a system as a state machine and replicates multiple copies of these state machines, such that failures are independent (meaning one failure only impacts one state machine). These state machines start with the same initial state, and the subsequent clients’ requests reach every replica in the same order, which applies those commands to transition to the same new state of the state machine (we are assuming deterministic logic that transitions the state machine from one state to the next).

If nothing could fail in a system, life would have been easier. However, real systems fail in a myriad of ways. In this lesson, we will primarily focus on node failures.

Note: SMR is an involved topic. In this abridged version, we will only highlight salient features of SMR. See the extended version for more details.

Node failures

A component is considered faulty when its behavior is inconsistent with its specification. The entire spectrum of failures considered in distributed systems is covered by its two ends.

Byzantine failure: This occurs in a node when it exhibits arbitrary behavior with other nodes, with the possibility of appearing to be working fine. Such failures cannot always be detected.
Fail-stop failure: This occurs in a node when its response to failure is to change to a state that reliably indicates its failure to other nodes.

Tolerating $t$ failures

We will specify fault tolerance using $t$ fault tolerance. A system is considered to be $t$ fault tolerant if it successfully carries out its functions, provided that no more than $t$ nodes failA t fault-tolerant system might continue to operate correctly even if more than t nodes fail, but its correct operation cannot be guaranteed beyond failures of t nodes..

Components of a state machine

A state machine consists of two main components: state variables and commands.

State variables

State variables of a state machine specify its state. These variables are any instances of data structures that provide details about the configuration of a server or information stored by them. They are like normal variables because they contain values.

Commands

Commands of a state machine are programs, functions, or code that alter the state of the state machine. These commands can perform the following operations:

Update values of state variables
Give an output

A vital characteristic of these commands is that they are deterministic. Another important characteristic of these commands is that their execution is atomic with other commands.

A state machine has the following complementary components.

Clients and requests

State machines have clients. A state machine’s client can request to execute its commands. The client specifies the state machine, the command’s name, and the information needed by the command in a request.

Output

A request produces an output when the command specified in the request produces an output. The output of a request (or command) can be to storage or memory hardware, a controlling device that controls hardware, or other clients. We can also refer to this output as the output of the state machine that processes the request.

Press + to interact

In the illustration above, the state machine has two state variables, arr and L (the length of arr). It has three commands: arr(val), update(val, ind), and read(val).

The diagram shows that its clients send requests in the order insert(3), update(5, 2), and read(2). The “Output(s)” box shows outputs that the state machine returns after processing these requests.

Request processing by state machines

State machines process one request at a time. The order in which state machines process requests is consistent with potential causality. The two important assumptions that a state machine’s clients can make about the order in which it processes requests are as follows:

[ $O1$ ]: A state machine processes requests by one of its clients in the order the client issued the requests. For example, a client $c$ issues requests to a state machine $sm$ in the following order: request $A$ , then request $B$ , and then request $C$ . In this case, $sm$ will process $A$ first, then $B$ , and then $C$ .
[ $O2$ ...

Introduction

Design Problems

Conclusion

State Machine Replication (SMR)

Motivation

Node failures

Tolerating $t$ failures

Components of a state machine

State variables

Commands

Clients and requests

Output

Request processing by state machines

State Machine Replication (SMR)

Motivation

Node failures

Tolerating ttt failures

Components of a state machine

State variables

Commands

Clients and requests

Output

Request processing by state machines

Tolerating $t$ failures