Fault Tolerance and Fault Injection

Explore fault tolerance and fault injection in cloud application development.

The concept of fault tolerance—that is, the ability of an application, platform, or runtime to tolerate a systemic fault—by itself seems a simple enough concept to grasp. After all, we’d expect an application to be able to gracefully recover if certain services were not available. In many cases, though, applications have been written with an understanding that the underlying infrastructure that hosts it is always available unless a catastrophic event occurs. While this reliability might be built into on-premises data centers and rarely challenged, the same assumption doesn’t hold for cloud platforms, services, and components.

Though cloud platforms will offer certain Service-Level Agreements (SLAs) for uptime on some cloud services, there’s always the possibility of a service-level or region-level outage that can come with no warning and vary widely in impact. Therefore, it’s important to keep these types of outages in mind when we’re developing an application for the cloud.

Get hands-on with 1200+ tech skills courses.