Introduction

We'll cover the following...

Introduction

Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:

  • Misconfiguration of network switches

  • Accidental power cycles

  • Power distribution unit (PDU) failures

  • Backbone failures for the entire datacenter

  • Power failure for the entire datacenter

Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently ...