Monitoring and Alerting with Prometheus

In this lesson, we will discuss why we need a proper monitoring system to carry out chaos experiments.

As I already mentioned, the critical ingredient that Chaos Toolkit does not provide is notifications whether a part of the system failed. Steady-state hypotheses are focused on what we know, and they are usually limited to a single application, network, storage, or node. By their nature, they are limited in their scope.

As you already know, we do need a proper monitoring system. We need to gather the metrics, and we are already doing that through Prometheus. We need to be able to observe those metrics to deduce the state of everything in our cluster. We can do that through dashboards. But that often proves to be insufficient beyond the initial stages. We need a proper alerting system that we will improve over time. Whenever we detect something through dashboards, we probably want to convert that discovery into an alert. If we do that, we will not need to keep staring at the monitor filled with “pretty colors” forever.

I prefer to use AlertManager. You might not have the same taste and choose to use something else. It could be, for example, DataDog, or anything else that suits your needs. Nevertheless, this is a course about chaos engineering. Even though monitoring is an essential part of it, it is still not in the scope. So, I will not go into monitoring in more detail because that would require much than a few paragraphs. A whole section would not be able even to scratch the surface. If you do feel you need more, you might want to check The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes.

You should be comfortable with monitoring and other related subjects. Without robust monitoring and alerting, there is no proper chaos engineering. Or, to be more precise, I don’t think that you should jump into chaos engineering without first mastering monitoring and alerting. So, figure out how to properly monitor, observe, and alert based on some thresholds in your system. Only after that, you’ll be able to do chaos experiments successfully.


In the next lesson, we will remove the resources that we have created.

Get hands-on with 1400+ tech skills courses.