Facebook: Optimized Datacenter Resource Allowance System

Introduction

Cluster managers run on a set of nodes and manage a cluster. It works with cluster agents who handle the complete cluster, including placing and managing containers or virtual machines on servers. The challenging task for cluster managers is to efficiently allocate resources in data centers understudy in the past decades.

Public clouds have acquired various techniques, including open-source systems such as Kubernetes and proprietary systems such as Google’s Borg, Facebook’s Twin, and Microsoft’s Protean.

The capacity reservation allows us to reserve computing instances in advance so that they can be used during critical events such as unscheduled maintenance, disaster recovery, or unusual workload incorporation.

In recent approaches, the problem is that there is a lack of knowledge on how to provide guaranteed capacity despite large-scale failures in data centers.

In this lesson, we describe how Facebook solved this problem for their on-premise infrastructure.

Challenges in providing guaranteed capacity

There are numerous challenges involved in providing guaranteed capacity. Initially, it needs to consider the independent and correlated failures across various components, including clusters, servers, rack, network switch, power row, and cooling systems. Hence, increasing the buffer capacity to handle all the potential failures is expensive in each aspect.

Second, the cluster manager stand-in needs to sustain a capacity guarantee despite ongoing infrastructure management events such as OS kernel upgrades, software updates, and hardware refresh. The cluster manager needs to promptly adopt replacement servers as each can cause a different extent of server capacity loss.

Third, as the nature of the workloads are different, and there might be various kinds of hardware installed in a cluster leading to hardware heterogeneity. Therefore, a cluster manager should provide capacity to meet the workloads constraints and hardware heterogeneity.

Lastly, there exists an inherited tradeoff between the quality and the speed of resource allocation, e.g., if we optimize for speed, we might not be able to provide guarantees to Large-scale failures. For example, to provide fast container allocation, we might also get an unbalanced spread across MSBs, concurrently. In conclusion, an MSB failure could be catastrophic for the reliability of the services.

Prior solutions

The most common approach of assigning servers to clusters is based on static scopes. For example, all servers in a data- center may belong to one cluster. Servers may be added to or removed from a cluster, but often these changes are manually initiated.

Currently, common techniques to assign servers to a cluster are performed statically. Often, a server may be added or removed from a cluster manually. The advantage of this method is that it reduces the candidate servers to be evaluated on the critical path of container placement. Hence, it enables new containers to be quickly deployed within a few seconds by using existing servers within a cluster.

However, this approach has some drawbacks. The static assignment of the server to a cluster let some cluster run out of capacity while others are underutilized. Secondly, the allocation of servers may be suboptimal due to variation in power and network consumption of workloads and different hardware requirements. Finally, service owners have to tackle the data center-scale failures by themselves individually.

The previous approach used by Facebook is to use a shared mega server pool that consists of all servers from data centers in a geographical region connected via a low-latency network. Twine arrange server into logical clusters called entitlements. When a new container needs to be placed but cannot fit on any existing server in an entitlement, a free server is added to entitlement taken greedily from a shared region-level free-server pool to host the new container. The server is returned to the shared free-server pool when the last container is decommissioned. On one side, the advantage of this approach is that a single server pool removes server capacity stranded in many smaller physical clusters. On the other side, it assigns a whole region’s server-to-entitlement on the critical path of container placement. In conclusion, Facebook had to adopt simple techniques to allow quick server-assignment decisions, which could lead to sub-optimal server assignment and could not provide guaranteed capacity in the event of correlated failures. Hence, both approaches are efficient but have their limitations. Ideally, a cluster manager should combine their advantages instead of their limitations.

RAS solution by Facebook

This lesson describes Twine’s new server-allocation component, called Resource Allowance System (RAS). RAS dynamically assign servers to a logical cluster called reservation. A reservation provides its workloads with a certain amount of guaranteed capacity that considers random and correlated failures, maintenance events, heterogeneous hardware resources, and compound workload requirements and characteristics.

RAS breaks resource allocation into the following two levels.

Assignment of servers to reservations off the critical path.
Placement of containers to servers within each reservation.

Through this approach, server-assignment constraints are removed from the latency-sensitive container-placement process. Further, they are evaluated at the reservation-creation time and maintained continuously.

Furthermore, through the two-level approach, each reservation is treated as a separate cluster, enabling multiple container allocators to run independently for better scalability. At last, each reservation incorporates the buffer capacity required for managing large-scale failures and maintenance, removing server-to-reservation assignments from the critical path of these operations.

RAS has several benefits over the previous solution. First, it eliminates the drawbacks of statically-scoped clusters and capacity stranded in clusters, including the responsibilities of service owners to prepare for large-scale failures individually. RAS resolves the issue by dynamically allocating servers to reservations based on workload characteristics and underlying infrastructure changes. Moreover, it also embeds and optimizes failure and maintenance buffers as part of reservations. The second advantage is that RAS eliminates the limitation of Twine’s previous approach of allocating servers on the critical path of container placement by assigning a reservation’s full capacity ahead of time. So, container placement can instantly use a free server already in the reservation. Finally, RAS provides the simple abstraction of workloads running on a reservation that offers guaranteed capacity and supports stacking. Apart from this, RAS also handles random and correlated failures, data center maintenance, heterogeneous hardware, and other data center constraints and realities.

Resource management realities

Various challenges arise to resource allocation due to the capacity scale, complexities of datacenters, and varying workload characteristics while providing guaranteed capacity within a region.

Region layout

Facebook operates in many regions around the globe. The following figure denotes the organization of a region. Each region consists of several data center buildings. Each of them is connected via high bandwidth and low latency network. As shown in the figure below, each data center building is composed of failure domains called the Main Switch Board (MSB) designed to fail independently. An MSB is composed of tens of thousands of servers.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Facebook: Optimized Datacenter Resource Allowance System

Introduction

Challenges in providing guaranteed capacity

Prior solutions

RAS solution by Facebook

Resource management realities

Region layout