How to achieve availability in distributed systems?

This is somewhat answered already—add redundancy in your system.

Build your system in such a way that when things go wrong, redundant resources can handle the load and continue serving your users.

In this context, let’s introduce the concept of SPoF.

Single Point of Failure (SPoF) in a distributed system means a component that can bring the entire system down if there is any failure in the node itself.

For example, your home router can be a SPoF. If the router is down, you lose access to the internet.

Press + to interact

To build an available system, you need to avoid any possible SPoF so that the system can continue to function even when there are failures.

Setting up expectations

So far we have seen how critical it is for a system to be resilient when it comes to faults, as well as how important it is for them to operate correctly in adverse scenarios. This is why distributed systems pursue properties like availability and reliability.

But the question is, how far should you go as a distributed system owner?

Let’s think of availability.

Assume My Cool App has 5 nodes running the server program. Due to well-designed architecture, the system is capable of losing 1 node or 2 nodes. What happens if 3 nodes crash? What if all the nodes crash at the same time?

Somewhere down the line, you need to set expectations.

If in your system, all nodes crash, no algorithm can help to ...

Introduction

What Distributed Systems Achieve for Us

Data in Distributed Systems

Communication Between Nodes

Data Processing in Large Scale

Distributed System Architectural Patterns

Case Study 1: Apache Spark

Case Study 2: Apache Druid

Conclusion

Achieving Availability

How to achieve availability in distributed systems?

Setting up expectations