...

/

State Machine Replication (SMR)

State Machine Replication (SMR)

Learn how to build replicated data stores that can tolerate faults.

Motivation

Providing fault-tolerant services to clients is a desirable property of a system. State machine replication (SMR) is a mechanism to implement fault-tolerant services. SMR models a system as a state machine and replicates multiple copies of these state machines, such that failures are independent (meaning one failure only impacts one state machine). These state machines start with the same initial state, and the subsequent clients’ requests reach every replica in the same order, which applies those commands to transition to the same new state of the state machine (we are assuming deterministic logic that transitions the state machine from one state to the next).

If nothing could fail in a system, life would have been easier. However, real systems fail in a myriad of ways. In this lesson, we will primarily focus on node failures.

Note: SMR is an involved topic. In this abridged version, we will only highlight salient features of SMR. See the extended version for more details.

Node failures

A component is considered faulty when its behavior is inconsistent with its specification. The entire spectrum of failures considered in distributed systems is covered by its two ends.

  • Byzantine failure: This occurs in a node when it exhibits arbitrary behavior with other nodes, with the possibility of appearing to be working fine. Such failures cannot always be detected.

  • Fail-stop failure: This occurs in a node when its response to failure is to change to a state that reliably indicates its failure to other nodes.

Point to ponder

1.

Why are we just considering Byzantine and fail-stop models? What about the other models?

Show Answer
Q1 / Q2

Tolerating tt failures

We will specify fault tolerance using tt fault tolerance. A system is considered to be tt fault tolerant if it successfully carries out its functions, provided that no more than tt nodes failA t fault-tolerant system might continue to operate correctly even if more than t nodes fail, but its correct operation cannot be guaranteed beyond failures of t nodes..

Components of a state machine

A state machine consists of two main components: state variables and commands.

State variables

State variables of a state machine specify its state. These variables are any instances of data structures that provide details about the configuration of a server or information stored by them. They are like normal variables because they contain values.

Commands

Commands of a state machine are programs, functions, or code that alter the state of the state machine. These commands can perform the following operations:

  1. Update values of state variables

  2. Give an output

A vital characteristic of these commands is that they are deterministic. Another important characteristic of these commands is that their execution is atomic with other commands.

A state machine has the following complementary components.

Clients and requests

State machines have clients. A state machine’s client can request to execute its commands. The client specifies the state machine, the command’s name, and the information needed by the command in a request.

Output

A request produces an output when the command specified in the request produces an output. The output of a request (or command) can be to storage or memory hardware, a controlling device that controls hardware, or other clients. We can also refer to this output as the output of the state machine that processes the request.

Press + to interact
This diagram enumerates all components required to model a system as a state machine
This diagram enumerates all components required to model a system as a state machine

In the illustration above, the state machine has two state variables, arr and L (the length of arr). It has three commands: arr(val), update(val, ind), and read(val).

The diagram shows that its clients send requests in the order insert(3), update(5, 2), and read(2). The “Output(s)” box shows outputs that the state machine returns after processing these requests.

Request processing by state machines

State machines process one request at a time. The order in which state machines process requests is consistent with potential causality. The two important assumptions that a state machine’s clients can make about the order in which it processes requests are as follows:

  1. [O1O1]: A state machine processes requests by one of its clients in the order the client issued the requests. For example, a client cc issues requests to a state machine smsm in the following order: request AA, then request BB, and then request CC. In this case, smsm will process AA ...