Measure saturation #

Saturation measures the fullness of our services and the system. We should be aware if replicas of our services are processing too many requests and being forced to queue some of them. We should also monitor whether usage of our CPUs, memory, disks and other resources reaches critical limits.

Measure CPU usage #

For now, we’ll focus on CPU usage. We’ll start by opening the Prometheus's graph screen.

open "http://$PROM_ADDR/graph"

Let’s see if we can get the rate of used CPU by node (instance). We can use node_cpu_seconds_total metric for that. However, it is split into different modes, and we’ll have to exclude a few of them to get the “real” CPU usage. Those will be idle, iowait, and any type of guest cycles.

Please type the expression that follows, and press the Execute button.

sum(rate(
  node_cpu_seconds_total{
    mode!="idle", 
    mode!="iowait", 
    mode!~"^(?:guest.*)$"
  }[5m]
)) 
by (instance)

Switch to the Graph view.

The output represents the actual usage of CPU in the system. In my case (screenshot below), excluding a temporary spike, all nodes are using less than a hundred CPU milliseconds. The system is far from being under stress.

Percentage of used CPU #

As you already noticed, absolute numbers are rarely useful. We should try to discover the percentage of used CPU. We’ll need to find out how much CPU our nodes have. We can do that by counting the number of metrics. Each CPU gets its own data entry, one for each mode. If we limit the result to a single mode (e.g., system), we should be able to get the total number of CPUs.

Please type the expression that follows, and press the Execute button.

count(
  node_cpu_seconds_total{
    mode="system"
  }
)

In my case (screenshot below), there are six cores in total. Yours is likely to be six as well if you’re using GKE, EKS, or AKS from the Gists. If, on the other hand, you’re running the cluster in Docker For Desktop or minikube, the result should be one node.

Now we can combine the two queries to get the percentage of ...

Before Getting Started

Autoscaling Deployments and StatefulSets

Auto-Scaling Nodes Of A Kubernetes Cluster

Collecting and Querying Metrics and Sending Alerts

Debugging Issues Discovered Through Metrics and Alerts

Extending HorizontalPodAutoscaler With Custom Metrics

Visualizing Metrics And Alerts

Collecting And Querying Logs

Conclusion

Alerting on Saturation-related Issues

Measure saturation #

Measure CPU usage #

Percentage of used CPU #