Copy of Put Back-of-the-Envelope Numbers in Perspective
Using realistic numbers in back-of-the-envelope calculations.
A distributed system has compute nodes, connected via a network. There is a wide variety of available compute nodes and they can be connected in many different ways. Back-of-the-envelope calculations help us to ignore nitty-gritty details of the system (at least at the design level) and to focus on more important aspects.
An example of the back-of-the-envelope calculation could be:
- Number of concurrent TCP connections a server can support
- Latency of a cross-continental link connecting two data centers
Choosing an unreasonable number for such calculations can irk the interviewer and send a wrong message that we are not aware of any real system.
Overview of the system
Let’s consider an example of an online classified platform, similar to OLX which is the world’s leading classifieds platform in growth markets and is available in more than 40 countries and over 50 languages.
Our platform will allow people to buy, sell or exchange a wide variety of used goods and services by enabling them to post a listing through their mobile phone or on the web in a fast and convenient manner.
Server requirements
Before we dig deep into the server requirements and assess different design options, let’s first look at some numbers to get an idea of the time taken by some of the routine computer operations.
The following tables shows the typical latency, throughput numbers, and server capabilities to guide our back-of-the-envelope calculations.
Note: Some of these numbers may be a little outdated considering all the advancements in modern day computers. However, they are enough to give us some idea about the time taken by some of the routine computer operations.
Latency Numbers
Component | Time (nano seconds) | Comment |
L1 cache reference | 0.5 | |
Branch mispredict | 5 | |
L2 cache reference | 7 | 14x L1 cache |
Mutex lock/unlock | 25 | |
Main memory reference | 100 | 20x L2 cache, 200x L1 cache |
Compress 1K bytes with Snzip | 3,000 (3 micro-seconds) | |
Send 1K bytes over 1 Gbps network | 10,000 (10 micro-seconds) | |
Read 4K randomly from SSD with speed ~1GB/sec SSD | 150,000 (150 micro-seconds) | |
Read 1 MB sequentially from memory | 250,000 (250 micro-seconds) | |
Round trip within same datacenter | 500,000 (500 micro-seconds) | |
Read 1 MB sequentially from SSD with speed ~1GB/sec SSD | 1,000,000 (1 milli-seconds) | 4X memory |
Disk seek | 10,000,000 (10 milli-seconds) | 20x datacenter roundtrip |
Read 1 MB sequentially from disk | 20,000,000 (20 milli-seconds) | 80x memory, 20X SSD |
Send packet CA->Netherlands->CA | 150,000,000 (150 milli-second) |
Throughput Numbers
Throughput (Mbps) | Comment | |
Across virtual machines on the same physical server | 0.5 | |
Out of local server | 5 | |
Between two system on the same rack | 7 | |
Between two systems on different racks | 25 | |
Typical cross-sectional bandwidth | 10,000 (10Gbps) 40 ro 100 Gbps becoming common Google's Jupiter fabrics based network can support 1 Petabit/sec bisectional bandwidth Near-by data centers usually have high bandwidth as compared to long-haul | |
Typical aggregate bandwidth between two datacenters | 10,000 (10 Gbps) and onwards | |
SSD Read/Write | ||
Disk Read/Write |
Server Capabilities
Component | Count | Comment |
Number of processors | 2 | |
Number of cores | 160 | |
RAM | Up to 8TB | |
Number of disks | Up to 24 | |
Network interfaces | LOM: OCP3.0 NIC PCIe x16 Gen4 Management: 1x 1GbE from BMC to Rear RJ45 Connector | Lights-out management (LOM): It is a form of out-of-band management that enables system administrators to monitor and maintain by connecting remotely to servers and other data center infrastructure. |
No. of servers required
Let’s assume we have a total of 350M active users and a single user makes 20 requests per minute on average.
- This will give us a total of 350M * 20 = 7B requests per day or approximately 81K requests per second which is out queries per second (QPS) for our system.
- Assuming an average request takes 50 milliseconds of compute time on a single core, each server will be able to process 20 requests per second. Since our server has 160 cores, therefore it will be able to process 320 requests per second.
- Putting these numbers together, we will need approximately 81K / 320 ~= 260 servers to handle all requests.
Cache
Caching refers to storing of data to serve future requests for that data faster. That cached data may be a copy of data stored elsewhere or a result of an earlier operation.
For our system, we can use a number of ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy