We label a service a distributed system if multiple computing nodes are involved. Each node is physically distinct, does not share any computing resources, and can communicate only via network messages. Google Search, Netflix, and Amazon are examples of distributed systems.
Let's see some of the primary reasons we need a distributed system.
Making large-scale services possible: Many tasks are beyond the capacity of any single computer. For example, crawling the World Wide Web periodically and indexing this information with respect to keyword searches by billions of concurrent users is a task that involves thousands of servers.
Note: Grace Hopper, a mathematician and pioneering computer scientist who was one of the first programmers of the Harvard Mark I computer, had great farsightedness, as is evident from one of her famous quotes from the 1970s: “In pioneer days, they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers but for more systems of computers.”
Some argue that
Due to the distance of the service to its dispersed client base, clients away from a central service will incur a longer latency. Replicating services near the customers will need a distributed system.
The article assumes that
The failure cases are not adequately discussed, such as the outage of a full data center and the replacement of a failed central server.
Multiple cores in a server and NUMA-based memory access mean that a single server these days has many traits in common with a distributed system.
Our takeaway from this discussion is that while modern servers can help reduce the overall fleet size (hence the cost benefits), we can't entirely do away with the need for a distributed system.
Cost efficiency: As we try to make computational equipment bigger/more capable, the associated dollar cost does not increase linearly (cost often goes up super linearly), often making such an endeavor prohibitively costly. For example, we might wish to have a network switch to enable any-to-any communication, with high bandwidth, amongst hundreds of thousands of servers residing in a data center. But no such switch exists, and instead, we use smaller switches in a
Fault tolerance: A computer can fail due to a myriad of hardware or software glitches. A distributed system enables a service to hide most of the component level (or even data center level) failures from the end customers, giving the operators the necessary time to repair the failed components. For example, our data might be geo-replicated at three places. Losing any one of them will trigger the service to reroute our read/write requests to a secondary copy with little to no hiccups for us. Over time, the service will either restore the failed copy, or a new server might be assigned to take over the failed server.
Note: You can dive into understanding how data replication works in a fault-tolerant manner by studying Google's GFS file system design.
Reduced latency for the customers: If a service is only hosted, say, on the West Coast of the United States, the customers on the East Coast will need to deal with larger delays, and the customers in Europe and Asia might suffer much longer delays. Physical limitations, such as the speed of light, make it impossible to communicate any faster. Therefore, it becomes necessary to bring the service near the customers. Common examples are Netflix and other media that are time-sensitive and are often served from near the customer via content delivery networks.
The following illustration shows how major cloud computing players distribute their data centers and connect them via specialized networks to provide low-latency services to their customers.
Hopefully, we are convinced by now that modern services need distributed computational infrastructure to fulfill their goals. And now that we have opened this Pandora's box, let's see what challenges await us.
We’ll discuss some of the most difficult aspects of distributed systems:
Partial failures are common.
The network can lose, delay, duplicate, or reorder messages.
Failures can be hard to detect.
Node clocks are unreliable.
A single server/computer is designed so that when an unexpected fault happens, the system declares itself failed (instead of producing wrong results.) Examples are kernel panic in the Linux operating system and the “blue screen of death” in the Windows operating system—on unexpected faults such as memory corruption, they halt instead of working unreliably and producing possibly wrong results. Therefore, such a system mainly works in a binary fashion—either the system is working correctly or has failed and has stopped working. This binary behavior makes it easier to deal with them.
On the contrary, in a distributed system of many nodes, one or some nodes might have failed while the rest are working. Therefore we don't have that binary state (fully working or failed and stopped); instead, partial failures are possible. Such partial failures add non-determinism to the system. In fact, in a large distributed system, it is often assumed that something is always broken. Naturally, detecting and managing such failures is necessary.
Note: Large supercomputers such as
solve many scientific problems, and often on some fault, the whole system is halted, the fault is corrected, and the system resumes by loading from the last checkpoint. However, a large distributed service such as Google Search can't operate like that. Frontier FRONTIER - HPE CRAY EX235A, AMD OPTIMIZED 3RD GENERATION EPYC 64C 2GHZ, AMD INSTINCT MI250X, SLINGSHOT-11
Transient network problems are also common. Such problems include
Networks are often modeled as synchronous or asynchronous. A synchronous network is one where message delivery has a known upper bound, and we can figure out which messages were lost. On the contrary, in an asynchronous network, a message can take arbitrarily long to reach its recipient. This means in an asynchronous model, timeouts are not a good detector of lost messages.
A real network, such as the Internet, is neither purely synchronous nor asynchronous. Instead, most of the time, the Internet remains in a good state, behaving like a synchronous network, while when congestion is high, it might behave more like an asynchronous network. That partial synchrony is helpful, but we need to deal with occasional asynchrony—protocols should be able to detect and manage when the network becomes asynchronous.
A typical network delay has many components, such as transmission, queuing, and propagation delays. The queuing delays are usually responsible for high variability (jitter) in network latency. The Internet is often dubbed as a network of queues, where a packet gets through a queue at one router only to get queued in the next. Like a busy traffic intersection during rush hour, these queues can build up, adding substantial delay to the packets.
Packets can be dropped at a queue if there is no space to queue them. Senders might send multiple copies of the same data if they don't receive acknowledgment of a previous message in a reasonable time. It might be the case that an earlier copy is sitting in some queue, waiting for its turn, and ultimately more than one copy of data reaches the recipient.
It is often hard to detect a node failure. One of the reasons is due to message losses over the network. Not getting a response from the node is not a sure signal that the node has failed. The message might have dropped midway, or the acknowledgment might not have reached the sender. This is the classic Two Generals’ problem.
Note: Timeouts are not a reliable detector of node failure in an asynchronous system.
The node pausing at unexpected times can make programming challenging. As an example, see the following code snippet where a node acquires a lease and ensures that it has sufficient time left on the current lease (line 6). If this node pauses, say, just before executing line 11 for 20 seconds, on the resumption of the node, the code assumes that it still has a valid lease, while it does not. This scenario is a lot like concurrent code, where we can't rely on timing. (We will see a possible solution to this situation a little later in this blog.)
// Request processing loopwhile (true) {request = getIncomingRequest();// Ensure that the lease always has at least 10 seconds remainingif (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {lease = lease.renew();}if (lease.isValid()) {process(request);}}
The following illustration pictorially shows the challenges due to unexpected node pauses. Such pauses are not hypothetical. They have been
We need some notion of time for many tasks, such as detecting timeout on a network message so that it may be retransmitted, logging the amount of time a client spends on different pages of our website, and tracking when a DNS cache entry will expire and will need to be ejected from the local cache. Unfortunately, it is impossible to perfectly synchronize clocks on different nodes: despite using specific protocols such as NTP, many challenges persist.
Note: Time-of-day clocks and monotonic clocks are different and are used for different purposes. Using time-of-day clocks instead of monotonic clocks can lead to serious errors. Monotonic clocks are designed in such a way that they can't go back in time. However, if need be, they can be made a bit faster or slower to adjust the skew in clock value.
We will see two examples where unsynchronized clocks can cause damage.
The first example is of Cloudflare. In the following code snippet (at line 7), the code is trying to find the duration between now and an earlier time start
. Though at that specific point, the clock was set backward such that now - start
gave a negative value. It was assumed that rtt
would always be positive, and unexpectedly getting a negative value had a cascading impact, failing their service for many of their customers. This problem's
Note: We often assume that if two processes are on two different nodes, their failures will be independent—meaning that the failure of one node won’t impact the other processes. However, as illustrated by the Cloudflare example where the same bug impacted many of their nodes in a data center, that is not always the case: a bug in one node may impact multiple nodes.
// Following code is taken from the Cloudflare blog// where they highlight how an unexpected negative// value impacted their service in 2017// See for details:// https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/if !start.IsZero() {rtt := time.Now().Sub(start)if success && rcode != dns.RcodeServerFailure {s.updateRTT(rtt)} else {s.updateRTT(TimeoutPenalty * s.timeout)}}
Our second example is about a writing mechanism used in replicated data storage known as last write wins (LWW). If nodes use wall-clock time (that we are unable to synchronize perfectly), even an unsynchronized clock at a millisecond (ms) level can produce data consistency issues.
In the following illustration, it is clear that although x=2 happened after x=1, node 2 sets x=1 as the final value because the timestamp of
By now, we should be convinced that strange, unexpected things can happen in distributed settings that make designing software for distributed systems a challenging task. In the following section, we will see how researchers and system designers deal with the issues.
In this section, we survey some of the ways the above-mentioned challenges are managed. Researchers have devised rigorous formalisms to thoroughly understand the problem and to provide possible solutions. Let’s now discuss system models, failure models, and the correctness properties of systems.
Note: Challenging problems, especially those whose solution has substantial rewards, attract human curiosity and intellect. Over the last few decades, researchers and practitioners have worked hard to make today’s household services like Google, Facebook, Netflix, etc., possible. You can learn how YouTube works here.
With all the problems we explained above, a designer must have a strategy to determine the correctness of the system. We use safety and liveness conditions for this purpose. Safety means that nothing bad ever happens, and liveness means that something good eventually happens. If you want a more formal definition of safety and liveness, see the paper
Examples of using the safety and liveness properties are consensus algorithms, such as Paxos and Raft. The safety condition could be that a minority never changes a value chosen by a majority. The liveness condition can be that the consensus makes progress as long as a simple majority of participating nodes are available.
There is often a tradeoff between safety and liveness properties. Another reason to analyze the state of the system through two separate lenses, safety and liveness, is that for many use cases, it is acceptable to sacrifice liveness (at least for a while) as long as we never compromise on safety. In the context of consensus algorithms, the Paxos and Raft systems are always consistent (a safety condition) but might not make progress (a liveness condition) when a majority is unavailable.
It turns out that the execution environment in which a distributed system operates has a major impact on what can (and cannot) be done. Researchers have devised two kinds of computational models—synchronous and asynchronous.
A synchronous model is one where:
Nodes take a known bounded time to process any requests.
Network messages take a known bounded time to reach their recipients.
The clocks on different nodes have a known bound on the skew that is never violated.
This is a good model where everyone behaves nicely.
An asynchronous model is one where:
Nodes can take arbitrarily long to process requests.
Messages can take arbitrarily long to be delivered over the network.
The clock skew is also not bounded.
This is a harsh model of a computational environment with multiple sources of uncertainty and incomplete information.
The question is how closely our real computational environment matches these theoretical models. The reality is neither as pretty as in the synchronous model nor probably as bad as in the asynchronous model. In some sense, synchronous and asynchronous models are two ends of a spectrum. They help us prove exciting things such as:
X
is impossible in a synchronous model.
Y
can be done in an asynchronous model.
Implicit here is the notion that if X
is impossible in a nice synchronous model, it is definitely impossible for anything farther along the spectrum, and if something is possible with a tough model, it is definitely possible with a model with less uncertainty.
Once again, we define a spectrum, this time, of failures—the easiest to manage, we call crash-stop failures, while the hardest to manage, we call Byzantine failures. They serve a similar purpose as in our discussion on computational models: to prove properties.
Crash-stop failures are those where when a fault occurs, the affected node stops working and does not participate in any further processing. The Byzantine failures are arbitrary failures, where participants can behave in an unexpected (possibly malicious) manner.
Note: We recently added a new section on consensus protocols in our course on Grokking the Principles and Practices of Advanced System Design. You may read that section to see safety, liveness, and computational and fault models in action.
TCP can hide message losses, message reordering, and message duplications. However, hiding message latency is not always possible. System designers go to great lengths to reduce latency for their customers. For example,
Because a misbehaving network is at the root of many challenges, hyper scalers such as
Modern services rely on extensive monitoring to deal with failed nodes. Still, it is possible that monitoring declares some nodes failed when it is not. To deal with such issues, systems are built in such a way that safety conditions are met (for example, in the context of consensus, there is never more than one leader). Similarly, implementing
We saw that synchronizing physical clocks on different nodes using network protocols like NTP is challenging. Leslie Lamport invented logical clocks that can be implemented using simple counters and suggested tagging each network message with these counters. A desirable property of such clocks is that they should support causality. For example, if Event A happens before Event B at a single node, the local logical clock value should be assigned such that the clock value of Event B is larger than the clock value of Event A. When a node sends a message to another node, the clock value at reception should be higher than the one when the message was sent.
Note: See our chapter on sequencers to know how to generate monotonically increasing, causality bearing counters.
Let's see how we can solve our earlier LWW issue with unsynchronized clock values. The following illustration shows that sending a monotonically increasing fencing token in each request helps us avoid the pitfalls that we saw earlier. When a node receives an old token with a request, the node rejects such requests.
The next innovation came with Google's Spanner, which introduced TrueTime API. The interesting observation was that while it is true that clocks on nodes of a distributed system cannot be perfectly synchronized, how about explicitly exposing the uncertainty in the clock value to the applications while trying hard to keep the uncertainty values low?
Spanner uses atomic and GPS clocks to keep the clock uncertainty around 6 ms. When TrueTime API gives a clock value, it also tells the current uncertainty bound.
Time with controlled uncertainty might not seem much on the surface, but it is a significant feat. Spanner being a strongly consistent database at the global scale owes quite a bit to its TrueTime API.
The following illustration shows that a timestamp of
The following illustration shows that the uncertainty in the clock increases over time, but the Spanner system strives hard to keep it below 6 ms on average. The slanted lines in the following illustration show a gradually increasing skew in the clock value. Occasionally, the uncertainty can go beyond this average value, and Spanner and its applications have special protocols to deal with such situations (for example, waiting out the uncertainty when they want to be sure that a time point has definitely passed.)
Distributed systems are fundamentally different from writing software on a single computer. A distributed system can fail in a myriad of new, unexpected ways. A service built on a distributed system aims to shield its customers from all possible failures and yet serve its purpose well. In this blog, we surveyed some of the primary challenges in building distributed services and how we can manage those challenges. However, we were only able to scratch the surface, and our purpose was to stimulate your interest. To go deeper into these topics, we recommend our following courses: Distributed Systems for Practitioners and Grokking the Principles and Practices of Advanced System Design.
Free Resources