...

/

Alerting on Unschedulable or Failed Pods

Alerting on Unschedulable or Failed Pods

In this lesson, we will see how to handle alerts in the case of Unschedulable or Failed Pods.

Cause of unschedulable or failed pods #

Knowing whether our applications are having trouble responding quickly to requests, whether they are being bombed with more requests than they could handle, whether they produce too many errors, and whether they are saturated, is of no use if they are not even running. Even if our alerts detect that something is wrong by notifying us that there are too many errors or that response times are slow due to an insufficient number of replicas, we should still be informed if, for example, one or even all the replicas failed to run. In the best-case scenario, such a notification would provide additional info about the cause of an issue. In a much worse situation, we might find out that one of the replicas of the DB is not running. That would not necessarily slow it down, nor would it produce any errors. However, it would put us in a situation where data could not be replicated (additional replicas are not running), and we might face a total loss of its state if the last standing replica fails as well.

There are many reasons why an application would fail to run. There might not be enough unreserved resources in the cluster. Cluster Autoscaler will deal with that problem if we have it. But, there are many other potential issues. Maybe, the image of the new release is not available in the registry. Or perhaps, the Pods are requesting PersistentVolumes that cannot be claimed. As you might have guessed, the list of the things that might cause our Pods to fail, be unschedulable or in an unknown state, is almost infinite.

Generating an alert after a while #

We ...