Enable Fault Tolerance and Failure Detection
Learn how to make a key-value store fault tolerant and able to detect failure.
Handle temporary failures
Typically, distributed systems use a quorum-based approach to handle failures. A quorum is the minimum number of votes required for a distributed transaction to proceed with an operation. If a server is part of the consensus and is down, then we can’t perform the required operation. It affects the availability and durability of our system.
We’ll use a sloppy quorum instead of strict quorum membership. Usually, a leader manages the communication among the participants of the consensus. The participants send an acknowledgment after committing a successful write. Upon receiving these acknowledgments, the leader responds to the client. However, the drawback is that the participants are easily affected by the network outage. If the leader is temporarily down and the participants can’t reach it, they declare the leader dead. Now, a new leader has to be reelected. Such frequent elections have a negative impact on performance because the system spends more time picking a leader than accomplishing any actual work.
In the sloppy quorum, the first healthy nodes from the preference list handle all read and write operations. The healthy nodes may not always be the first nodes discovered when moving clockwise in the consistent hash ring.
Let’s consider the following configuration with . If node is briefly unavailable or unreachable during a write operation, the request is sent to the next healthy node from the preference list, which is node in this case. It ensures the desired availability and durability. After processing the request, the node includes a hint as to which node was the intended receiver (in this case, ). Once node is up and running again, node sends the request information to so it can update its data. Upon completion of the transfer, removes this item from its local storage without affecting the total number of replicas in the system.