Reliability
Learn about reliability in distributed systems.
We'll cover the following...
In this lesson, we will discuss the concept of reliability, the first pillar of fault-tolerant systems.
What is reliability?
According to M Kleppmann, who put it in simple words in Designing Data Intensive Application, a reliable system is capable of “continuing to work correctly, even when things go wrong.”
Press + to interact
Let’s explain more.
When you build a distributed system, it makes hardly any sense if your system is not capable of handling the following:
- The system can serve users’ expectations, for example, if My Cool App is a photo-sharing app, then users should be able to share photos. If someone uploads a photo and the photo is not shown on their profile, then this leads to a bad user experience.
- If users make mistakes, the system should be able to tolerate the mistake. If a user of My Cool App uploads a video whereas the expectation is only photos, then the system should not break but handle it correctly. For instance, maybe the