...

/

Design a Server-Side Monitoring Service

Design a Server-Side Monitoring Service

Learn to design a service to monitor server-side errors.

It’s challenging to know what’s happening at the hardware or application level when our infrastructure is distributed across multiple locations and includes many servers. Components can run into failures, response latency overshoot, overloaded or unreachable hardware, and containers running out of resources, among other issues. Multiple services are running in such an infrastructure, and anything can go awry.

When one of the services goes down, it can be the reason for other services to crash, and as a result, the application is unavailable to users. If we don’t know what went wrong early, it could take us a lot of time and effort to debug the system manually.

Monitoring helps in analyzing such complex infrastructure where something is constantly failing. Monitoring distributed systems entails gathering, interpreting, and displaying data about the interactions between processes that are running at the same time. It assists in debugging, testing, performance evaluation, and having a bird’s-eye view over multiple services.

We will learn to design a monitoring service that focuses on server-side errors. These errors are usually visible to monitoring services as they occur on servers. Such errors are reported as error 5xx in HTTP response codes.


In a distributed system, why is a dedicated monitoring solution necessary instead of simply relying on individual server logs?

State your reasoning in the widget below.

Why do you need a dedicated monitoring service?

Requirements

Let's sum up what we want our monitoring system to do for us:

  • Server monitoring: It includes monitoring critical local processes on a server for crashes and detecting any anomalies in CPU, memory, disk, or network usage by server processes. Additionally, overall server health is monitored, encompassing metrics like CPU, memory, disk, network bandwidth, and average load.

  • Hardware monitoring: It involves monitoring hardware component faults on a server, such as memory failures, failing or slowing disk, and so on.

  • Datacenter infrastructure monitoring: It involves monitoring all network switches, load balancers, and other specialized hardware within the datacenter. Additionally, monitoring power ...