Introduction

Now that we’ve laid the foundation for resource estimation, let’s make use of the knowledge we gained in the previous lesson to estimate resources like servers, storage, and bandwidth. Below, we consider the Twitter service, make assumptions, and based on those assumptions, we make estimations. Let’s jump right in!

Number of servers required

Let’s make the following assumptions about a Twitter-like service.

Assumptions:

There are 500 million (M) daily active users (DAU).
A single user makes 20 requests per day on average.
We know that a single server (with 64 cores) can handle 64000 RPS.

Plausibility test: For all BOTECs, we need to judge if our numbers seem reasonable. For example, if the estimate we obtained was two servers for a large service with millons of DAUs, that number can be a lower bound but seems far from reality.

Peak capacity

Often, large services need to be ready for any flash crowds. We can get an estimate for the peak capacity. We assume that there’s a specific second in the day when all the requests of all the users arrive at the service simultaneously. We use it to get the capacity estimation for a peak load. To do better, we’ll need request and response distributionsA request-response distribution refers to the statistical pattern or model that describes the timing and frequency of requests made to a system or service and the responses provided by that system or service. For example, we can measure the number and types of incoming requests in 24 hours., which might be available at the prototyping levelThe prototyping level refers to the early stage of product development, where a basic version is created for testing and design validation before full development. It helps identify and address issues early in the process.. We might assume that requests follow a particular type of distribution, for example, the Poisson distributionPoisson2.

By using DAU as a proxy for peak load for a specific second, we’ve avoided difficulties finding the distributions of requests. Therefore, the DAU will then become the number of requests per second. So the number of servers at peak load can be calculated as follows:

Press + to interact

If our assumption is correct that all of the workloads can show up simultaneously in a specific second and each of our servers can handle 64,000 requests per second, we’ll need the astronomical count of 157K servers! If that’s not feasible, then we have two potential paths forward now, as explained below.

Improving the RPS of a server

First, if we think our assumption for the peak load is correct, we can work out how many servers at max we can commission. Let’s assume we can employ 100,000 servers at most. That implies:

We’ll need extensive engineering to bump the RPS we can extract from a server from 64,000 to 100,000!

There are many examples where organizations relied on extensive engineering to improve the RPS of servers.

First example: WhatsApp reported in 2012 that they can manage 2 million concurrent TCP connections on one server. A report in 2017 mentioned that WhatsApp uses ~ 700 servers from IBM for its service. It’s not clear what the specific specification of a server was.

Second example: A research ...

Daily active users (DAU)	500	Million
Requests on average / user / day	20
Total requests / day	f10	Billion
Total requests / second	f115	K
Total servers required	f2

Distributed Cache System

Pub-Sub

Blob Store

TikTok

Uber Eats

NewsFeed

Facebook Messenger

ChatGPT

Examples of Resource Estimation

Introduction

Number of servers required

Estimating the Number of Servers

Peak capacity

Improving the RPS of a server