System Design Interview: Fast-Track in 48 Hours/

...

Design a Client-Side Monitoring Service

Capture metrics, batch events, and minimize CPU. System Design Interview Fast-Track.

We'll cover the following...

Monitoring: metrics and alerting
Metrics
Alerting
Client-side errors
Failures due to a routing bug
Design of a client-side monitoring system
Conclusion

The modern economy depends on the continual operation of IT infrastructure. Such infrastructure contains hardware, distributed services, and network resources. These components are interlinked in such infrastructure, making it challenging to keep everything functioning smoothly without application downtime. There are two types of monitoring systems to monitor and capture different errors: client and server-side monitoring systems. In this lesson, we will focus on the client-side monitoring system.

Client-side errors are errors whose root cause is on the client side. Such errors are reported as error 4xx in HTTP response codes. Some client-side errors are invisible to the service when client requests fail to reach the service.

Let’s learn about the metrics and alerting in a monitoring system before designing the system.

Monitoring: metrics and alerting

A good monitoring system needs to clearly define what to measure and in what units (metrics). The monitoring system also needs to define threshold values of all metrics and the ability to inform appropriate stakeholders (alerts) when values are out of acceptable ranges. Knowing the state of our infrastructure and systems ensures service stability. The support team can respond to issues more quickly and confidently if they have access to information on the health and performance of the deployments. Monitoring systems that collect measurements, show data, and send warnings when something appears wrong are helpful for the support team.

To further understand metrics, alerts, and their connection with monitoring, we’ll review their significance, potential benefits, and the data we might want to keep track of.

Metrics

Metrics objectively define what we should measure and what units will be appropriate. Metric values provide an insight into the system at any point in time. For example, a web server’s ability to handle a certain amount of traffic per second or its inclusion in a web server pool are examples of high-level data correlated with a component’s specific purpose or activity. Another example can be measuring network performance in terms of throughput (megabits per second) and latency (round-trip time). We need to collect the values of metrics with a minimal performance penalty. We may measure this penalty using user-perceived latency or the number of computational resources.

Values that track how many physical resources our operating system uses can be a good starting point. If we have a monitoring system in place, we don’t have to do much additional work to get data regarding processor load, CPU statistics like cache hits and misses, RAM usage by OS and processes, page faults, disc space, disc read and write latencies, swap space usage, and so on. Metrics provided by many web servers, database servers, and other software help us determine whether everything is running smoothly or not.

Introduction

Elementary Design Problems

Advanced Design Problems

Concluding Remarks

Design a Client-Side Monitoring Service

Monitoring: metrics and alerting

Metrics

Populate the metrics