...

/

Lesson-01: Resource Management

Lesson-01: Resource Management

Introduction

Cluster managers run on a set of nodes and manage a cluster. It works with cluster agents who handle the complete cluster, including placing and managing containers or virtual machines on servers. The challenging task for cluster managers is to allocate resources in data centers efficiently. The capacity reservation allows us to reserve computing instances in advance to be used during critical events such as unscheduled maintenance, disaster recovery, or unusual workload incorporation.

Recent approaches are unable to provide guaranteed capacity dynamically during critical events, especially large-scale failures.

This series of lessons describes how Facebook solved this problem for their on-premise infrastructure by introducing a novel system. We will study the architecture of the proposed system in detail in upcoming lessons.

Challenges in providing guaranteed capacity

There are numerous challenges involved in providing guaranteed capacity. Each of these challenges is given below.

  1. It needs to consider the independent and correlated failures across various components of the data center. Hence, increasing the stand-by capacity to handle all the potential shortcomings is prohibitively expensive.

  2. The server manager should acquire replacement servers in normal infrastructure lifecycle events such as OS kernel upgrades, software updates, hardware refresh, and other physical maintenance to avoid server capacity loss ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy