...

/

Handling Hardware and Software Faults

Handling Hardware and Software Faults

Learn how to handle hardware and software faults in distributed systems.

Now that we’ve discussed two common forms of faults, let’s now understand general techniques to handle such faults in a distributed system.

Press + to interact

Handling hardware faults

For disk failures, it’s a very common practice to add redundancy by having more than one disk drive store the same data. These disks are generally cheap ones. Storing the same data in more than one disk helps to make sure that if there is an unrecoverable disk failure, then another disk can be used to recover the data.

Note: Arranging multiple disks to store the same copy of data is called RAID (Redundant Array of Independent Disks). Say your initial system of the My Cool App was like this:

  • A single node.
  • The server and database process are both running on the same node.

Now, if there is a failure in the disk of the single node, it’s unusable. In this ...