Alerting on Error-related Issues

In this lesson, we will discuss the issues related to the Error Key Metric.

Monitor the rate of errors compared to the total number of requests #

We should always be aware of whether our applications or the system is producing errors. However, we cannot start panicking at the first occurrence of an error since that would generate too many notifications that we’d likely end up ignoring. Errors happen often, and many are caused by issues that are fixed automatically or are due to circumstances that are out of our control. If we are to perform an action on every error, we’d need an army of people working 24/7 only on fixing issues that often do not need to be fixed. As an example, entering into a “panic” mode because there is a single response with code in 500 range would almost certainly produce a permanent crisis. Instead, we should monitor the rate of errors compared to the total number of requests and react only if it passes a certain threshold. After all, if an error persists, that rate will undoubtedly increase. On the other hand, if it continues being low, it means that the issue was fixed automatically by the system ...