Bulkhead Pattern

Learn the Bulkhead design pattern and its usage.

Intent

This pattern enforces resource partitioning and damage containment to preserve partial functionality in the case of a failure.The Bulkhead pattern is also known as the Failure Containment Principle and the Damage Control Principle.

Context and problem

The Titanic disaster has been well studied over the years, and there are many lessons we can learn from it in the IT industry. Among the many reasons why it sank, a few of them are as follows:

  • Design flaws (watertight compartments did not reach high enough in order to allow more living space in first class).
  • Implementation/construction faults (the three million rivets used to hold different parts of the Titanic together were found to be made from substandard quality iron, and the collision with the iceberg badly impacted them).
  • Operational failures (the iceberg notice was given too late, and the ship was traveling too fast to react to any warning).

We can identify similar issues in software projects today too, but luckily they have not cost so many human lives so far, and as a consequence, we learn more slowly than other industries. However, that does not mean software failures have no consequences. Plenty of examples (such as the Therac-25 machine) caused human death because of software defects.

From an architectural point of view, the ship was designed to cope with four compartments being flooded, but the bulkheads that isolated the compartments did not reach the deck above and weren’t watertight. As a result, the water spread to more compartments and sank the ship. Similar cascading failures exist in software systems too. A failure in one component can exhaust all available resources, and it can spread to other components until the whole system is down. Avoiding similar failure scenarios requires resource partitioning and capacity isolation at all levels, from data centers to individual thread pools.

Forces and solution

The Bulkhead pattern in software systems works by the same principle as in ships. By partitioning a system into separate components and isolating the resources, failures cannot cascade and bring the whole system down. This pattern enforces the damage containment principle and improves the system’s resilience. Implementing the Bulkhead pattern can be done at many different granularity levels depending on the type of faults we want to protect the system from.

  • The most common way to apply the Bulkhead pattern is through physical redundancy. In this age of cloud computing and virtual machines, the only way to ensure that two hosts are on separate hardware (compute, storage, or networking) is by ensuring they are on two separate data centers (for ...