So far we’ve just discussed removing faulty elements and haven’t yet explored adding repaired or new elements to the system. Let's see how we can successfully integrate a new or repaired component into a system of state machine replicas.
Integrating repaired elements
It is not enough for the element being added to be non-faulty. It must also be in the right state to behave consistently with other components. Let's start by introducing some notation:
We define e[ri] as the state of a non-faulty element e after processing request r0 through ri. An element e that joins a configuration after request rjoin must be in the state e[rjoin] for it to behave consistently after joining so it may successfully become part of the system.
An element is self-stabilizing if its current state is completely defined by a fixed number of previously processed inputs, say k inputs. For such elements, all we need to do is ensure that the element runs long enough to process k inputs and will be in state e[rjoin]. For non-self-stabilizing elements, we need to do things differently. In the following discussion, we will discuss two such cases:
Logical clocks and fail-stop failures
When using logical clocks and assuming only fail-stop failures, we only require the state of a state machine replica smi. The state of smi will be correct since we know that smi is non-faulty. Let's consider the following three cases in which the integrated element is an output device, a client, or a state machine replica:
For an output device e, we require little information to integrate it. This information may include setup and startup information and other trivial information that changes infrequently and can be stored in state machine replicas.
For a client e, we can obtain the required information from other clients.
For a state machine replica e, we can use information from any of the non-faulty state machine replicas smi.
For an output or client e, we can communicate state e[rjoin] to e before the system processes requests with a larger unique identifier than uid(rjoin). Once e’s integration is complete, we can resume processing requests normally.
For a state machine replica e, we require a more complex solution. We will refer to the state machine being integrated as smnew. For a state machine replica smi to send values of its state variables and copies of any pending requests to smnew is not enough to ensure correct integration. The issue arises when there is a request that smi received after it sent e[rjoin] to smnew. This request will not be processed by smnew since it didn't know about it.
To solve this problem, smi must send client requests it receives after sending e[rjoin] to smnewin ascending order of their unique identifiers. Eventually, smnew will also start receiving requests directly from the clients. smnewwill inform smi about the identifiers received from specific clients. Using this information, smi can decide when to stop relaying client updates to smnew.