monitor clone
Motivation
IT infrastructure contains hardware, software, services, and network resources crucial to operating and managing IT infrastructure. The unplanned network downtime in companies can be costly. In October 2021, Meta’s apps were down for nearly nine hours, resulting in around $13 million an hour loss. Such losses emphasize the need to monitor the IT infrastructure.
The IT infrastructure is spread widely around the globe. The data centers are connected through private or public networks. Monitoring the servers in geo-separated data centers is essential. Outages are bound to happen, and the system can crash. But detection of failure before its outage is pivotal, and monitoring helps in this regard.
According to Amazon, on Dec 7, 2021, at 7:30 AM PST, an automated activity to scale the capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays in communication between these networks. The outage cost of Amazon was $66,240 per minute.
Today’s services rely on complex infrastructure, and something is constantly failing. While fault-tolerant system designs hide most of these failures from the end-users, it is crucial to catch them in time before they snowball into a bigger problem.
Requirements
Let’s sum up what we want our monitoring system to do for us:
-
Monitoring critical local processes on a server for crashes
-
Monitoring any anomalies in the use of CPU/Memory/Disk/Network bandwidth use by a process on a server
-
Monitoring overall server health (CPU, Memory, Disk, Network bandwidth, Average load, etc.)
-
Monitoring hardware component faults on a server (like memory failures, failing or ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy