Chain Of Failure

Learn about chain bugs, the connection of fault, error and failure, causes of crack propagation, and different beliefs about a fault tolerant system.

Independent events

Underneath every system outage is a chain of events like this. One small issue leads to another, which leads to another. Looking at the entire chain of failure after the fact, the failure seems inevitable. If you tried to estimate the probability of that exact chain of events occurring, it would look incredibly improbable. But it looks improbable only if you consider the probability of each event independently. A coin has no memory, so each toss has the same probability, independent of previous tosses.

A failure in one point or layer actually increases the probability of other failures. If the database becomes slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.

Chain of events

Here’s some common terminology we can use to be precise about these chains ...