Design of a Monitoring System

Learn about the initial design of a generic monitoring system.

Requirements

Let’s sum up what we want our monitoring system to do for us:

  • Monitor critical local processes on a server for crashes.

  • Monitor any anomalies in the use of CPU/memory/disk/network bandwidth by a process on a server.

  • Monitor overall server health, such as CPU, memory, disk, network bandwidth, average load, and so on.

  • Monitor hardware component faults on a server, such as memory failures, failing or slowing disk, and so on.

  • Monitor the server’s ability to reach out-of-server critical services, such as network file systems and so on.

  • Monitor all network switches, load balancers, and any other specialized hardware inside a data center.

  • Monitor power consumption at the server, rack, and data center levels.

  • Monitor any power events on the servers, racks, and data center.

  • Monitor routing information and DNS for external clients.

  • Monitor network links and paths’ latency inside and across the data centers.

  • Monitor network status at the peering points.

  • Monitor overall service health that might span multiple data centers—for example, a CDN and its performance.

We want automated monitoring that identifies an anomaly in the system and informs the alert manager or shows the progress on a dashboard. Cloud service providers provide a health status of their services:

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.