Reliability

Learn about reliability in distributed systems.

In this lesson, we will discuss the concept of reliability, the first pillar of fault-tolerant systems.

What is reliability?

According to M Kleppmann, who put it in simple words in Designing Data Intensive Application, a reliable system is capable of “continuing to work correctly, even when things go wrong.”

Press + to interact

Let’s explain more.

When you build a distributed system, it makes hardly any sense if your system is not capable of handling the following:

  • The system can serve users’ expectations, for example, if My Cool App is a photo-sharing app, then users should be able to share photos. If someone uploads a photo and the photo is not shown on their profile, then this leads to a bad user experience.
  • If users make mistakes, the system should be able to tolerate the mistake. If a user of My Cool App uploads a video whereas the expectation is only photos, then the system should not break but handle it correctly. For instance, maybe the
...