Alerting on Saturation-related Issues

In this lesson, we will discuss the issues related to the Saturation Key Metric.

Measure saturation #

Saturation measures the fullness of our services and the system. We should be aware if replicas of our services are processing too many requests and being forced to queue some of them. We should also monitor whether usage of our CPUs, memory, disks and other resources reaches critical limits.

Measure CPU usage #

For now, we’ll focus on CPU usage. We’ll start by opening the Prometheus's graph screen.

open "http://$PROM_ADDR/graph"

Let’s see if we can get the rate of used CPU by node (instance). We can use node_cpu_seconds_total metric for that. However, it is split into different modes, and we’ll have to exclude a few of them to get the “real” CPU usage. Those will be idle, iowait, and any type of guest cycles.

Please type the expression that follows, and press the Execute button.

sum(rate(
  node_cpu_seconds_total{
    mode!="idle", 
    mode!="iowait", 
    mode!~"^(?:guest.*)$"
  }[5m]
)) 
by (instance)

Switch to the Graph view.

The output represents the actual usage of CPU in the system. In my case (screenshot below), excluding a temporary spike, all nodes are using less than a hundred CPU milliseconds. The system is far from being under stress.

Prometheus' graph screen with the rate of used CPU grouped by node instances
Prometheus' graph screen with the rate of used CPU grouped by node instances

Percentage of used CPU #

As you already noticed, absolute numbers are rarely useful. We should try to discover the percentage of used CPU. We’ll need to find out how much CPU our nodes have. We can do that by counting the number of metrics. Each CPU gets its own data entry, one for each mode. If we limit the result to a single mode (e.g., system), we should be able to get the total number of CPUs.

Please type the expression that follows, and press the Execute button.

count(
  node_cpu_seconds_total{
    mode="system"
  }
)

In my case (screenshot below), there are six cores in total. Yours is likely to be six as well if you’re using GKE, EKS, or AKS from the Gists. If, on the other hand, you’re running the cluster in Docker For Desktop ...

Access this course and 1400+ top-rated courses and projects.