What is Transient Fault Handling?

Overview

Cloud computing is the fifth generation of computing and has brought an industrial revolution impossible to imagine a decade ago. However, with every advancement comes a new set of problems, and the biggest problem faced by cloud computing initially was transient failure.

Transient failures include the momentary loss of network access to components and services, the brief unavailability of a service, and timeouts that occur when a service is busy.

However, ever since these defects became widespread, experts have identified a pattern to deal with them.

Cloud computing

Transient Fault Pattern Handling

The Transient Fault Handling Pattern, also called the Retry Pattern, provides us with a tried and tested method to handle a transient fault: try until it works. While this might seem odd because it simply tells us to retry the operation and hope that the fault gets resolved, the method works.

Retrying the failing request

Explanation

In this example, the interaction between the user and the cloud service failed in the first and second attempts of the operation but is successful on the third try.

Where can we apply the Transient Fault Handling Pattern?

We’ve seen how such a simple solution has been effective in the industry. However, it isn’t applicable everywhere and we have to consider a couple of factors before retrying constantly.

Knowing which failures to retry

What if a relational database rejects a connection because of incorrect credentials? In this case, retrying won’t do us any good, so it’s important to identify the problem if possible before retrying continuously.

The period between retrying

An overwhelming retry strategy could possibly result in further throttling and blacklisting of a service user, or it could fully overwhelm and ruin a busy service, preventing it from recovering at all.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved