Reliability
Explore the concept of reliability in distributed systems and understand how such systems continue to function correctly even when faults occur. Learn about handling hardware and software failures, fault detection, and recovery mechanisms to build fault-tolerant and dependable distributed applications.
We'll cover the following...
We'll cover the following...
In this lesson, we will discuss the concept of reliability, the first pillar of fault-tolerant systems.
What is reliability?
According to M Kleppmann, who put it in simple words in Designing Data Intensive Application, a reliable system is capable of “continuing to work correctly, even when things go wrong.”
Let’s explain more.
When you build a distributed system, it makes hardly any sense if your system is not capable of handling the following:
- The system can serve users’ expectations, for example, if My Cool App is a photo-sharing app, then users should be able to share photos. If someone uploads a photo and the photo is not shown on their profile, then this leads to a bad user experience.
- If users make mistakes, the system should be able to tolerate the mistake. If a user of My Cool App uploads a video whereas the expectation is only photos, then the system should not break but handle it correctly. For instance, maybe the