Achieving Availability
Build an understanding of how availability is achieved in distributed systems.
How to achieve availability in distributed systems?
This is somewhat answered already—add redundancy in your system.
Build your system in such a way that when things go wrong, redundant resources can handle the load and continue serving your users.
In this context, let’s introduce the concept of SPoF.
Single Point of Failure (SPoF) in a distributed system means a component that can bring the entire system down if there is any failure in the node itself.
For example, your home router can be a SPoF. If the router is down, you lose access to the internet.
To build an available system, you need to avoid any possible SPoF so that the system can continue to function even when there are failures.
Setting up expectations
So far we have seen how critical it is for a system to be resilient when it comes to faults, as well as how important it is for them to operate correctly in adverse scenarios. This is why distributed systems pursue properties like availability and reliability.
But the question is, how far should you go as a distributed system owner?
Let’s think of availability.
Assume My Cool App has 5 nodes running the server program. Due to well-designed architecture, the system is capable of losing 1 node or 2 nodes. What happens if 3 nodes crash? What if all the nodes crash at the same time?
Somewhere down the line, you need to set expectations.
If in your system, all nodes crash, no algorithm can help to ...