Overview

This lesson explains that we will be improving upon our application and experimenting with its availability.

We saw how to define chaos experiments that terminate our instances for the sake of validating whether our application is fault-tolerant.

Fault-tolerance is necessary, but it is insufficient for most of us. Typically, we don’t want our applications to spin up after something happens to them if that means that we will have seconds or maybe even minutes of downtime. Failed Pods will indeed be re-created when they are managed by higher-level constructs like Deployments and StatefulSets. However, we might have downtime between destruction and being available again, and that’s not really a good thing.

In this section, we’ll try to define chaos experiments that will validate whether the demo application, the same one we’ve been using, is highly available. As a result of those experiments, we might need to change its definition. We might need to improve it.

Let’s get going and see how that works for us.


In the next lesson, we will be provided with the gist that contains all the commands for this chapter.

Get hands-on with 1300+ tech skills courses.