What is fault tolerance?

Fault tolerance refers to the ability of a system (computer, network, cloud cluster, etc.) to continue operating without interruption when one or more of its components fail.

Fault-tolerant systems aim to ensure high-availability of the system by preventing disruptions arising from a single point of failure.

There are two fault-tolerant approaches:

  • fault-removal – This can be either forward error recovery or backward error recovery.
  • fault-masking – when the presence of one defect hides the presence of another defect in the system.

For fault tolerance with zero downtime (constantly active), a “hot” failover(instantly transfers workloads to a working backup system) needs to be implemented. If maintaining a constantly-active standby system is not required, a “warm” or “cold” failover system can be implemented where a backup system loads and starts running workloads. The speed of a warm/cold failover is slower because of the loading times.

Fault-tolerant computing offers little protection against software failure, which is a major cause of downtime and data center​​ outages for most organizations.

svg viewer

Forward and Backward error recovery

Forward error recovery involves identifying the error and, based on this knowledge, correcting the system state containing the error. Exception handling in high-level languages like Ada and PL/1 provides a system structure that supports forward recovery.

Backward error recovery corrects the system state by restoring the system to a stable state that existed prior to the manifestation of the fault.

Advantages of fault-tolerant systems

The key purpose of creating fault tolerance is to avoid (or at least minimize as much as possible) a situation where the functionality of the system becomes unavailable due to a fault in one or more of its components.

Fault tolerance is necessary for systems that are used to protect people’s safety (such as air traffic control hardware and software systems) and in systems that security, data protection, data integrity, and high-value transactions all depend on.

Fault-tolerant systems provide an excellent safeguard against equipment failure, but they can be extraordinarily expensive to implement because they require a fully redundant set of hardware that needs to be linked to the primary system.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved