Monitor the rate of errors compared to the total number of requests #

We should always be aware of whether our applications or the system is producing errors. However, we cannot start panicking at the first occurrence of an error since that would generate too many notifications that we’d likely end up ignoring. Errors happen often, and many are caused by issues that are fixed automatically or are due to circumstances that are out of our control. If we are to perform an action on every error, we’d need an army of people working 24/7 only on fixing issues that often do not need to be fixed. As an example, entering into a “panic” mode because there is a single response with code in 500 range would almost certainly produce a permanent crisis. Instead, we should monitor the rate of errors compared to the total number of requests and react only if it passes a certain threshold. After all, if an error persists, that rate will undoubtedly increase. On the other hand, if it continues being low, it means that the issue was fixed automatically by the system ...

Before Getting Started

Autoscaling Deployments and StatefulSets

Auto-Scaling Nodes Of A Kubernetes Cluster

Collecting and Querying Metrics and Sending Alerts

Debugging Issues Discovered Through Metrics and Alerts

Extending HorizontalPodAutoscaler With Custom Metrics

Visualizing Metrics And Alerts

Collecting And Querying Logs

Conclusion

Alerting on Error-related Issues

Monitor the rate of errors compared to the total number of requests #