Home/Blog/System Design/The Joy and Pain of Distributed Systems
Home/Blog/System Design/The Joy and Pain of Distributed Systems

The Joy and Pain of Distributed Systems

Abdul Qadeer
Sep 05, 2023
21 min read

Leslie LamportLeslie Lamport, a renowned computer scientist and Turing Award laureate, has made significant contributions to the field of distributed systems. His work includes the development of important concepts and algorithms that have shaped the understanding and design of distributed systems. Some of Lamport's prominent work includes Lamport Clocks and Paxos. once said: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” That anecdotal description of a distributed system might seem tongue-in-cheek, but it has a lot of truth in it. For example, an Amazon Web Services outage in November 2020 broke people's robotic vacuums and doorbellsAWS: Amazon web outage breaks vacuums and doorbells, leaving even computing professionals scratching their heads trying to figure out what a cloud provider's outage has to do with the functioning of vacuum cleaners and doorbells.

Martin KleppmannMartin Kleppmann is a research fellow at the Chair of Distributed Systems & Operating Systems at TU Munich. He is the author of the widely popular book Designing Data-Intensive Applications. says: "If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so." According to him, going from a single, centralized computing node to a distributed system is like opening a Pandora's box of challenges. The benefits that come from distributed systems will take a hefty bait in terms of complexity.

Writing software for a distributed system vs. a single computer might be as if we are opening Pandora's box of challenges
Writing software for a distributed system vs. a single computer might be as if we are opening Pandora's box of challenges

What is a distributed system, and why do we need it?#

We label a service a distributed system if multiple computing nodes are involved. Each node is physically distinct, does not share any computing resources, and can communicate only via network messages. Google Search, Netflix, and Amazon are examples of distributed systems.

Let's see some of the primary reasons we need a distributed system.

Making large-scale services possible: Many tasks are beyond the capacity of any single computer. For example, crawling the World Wide Web periodically and indexing this information with respect to keyword searches by billions of concurrent users is a task that involves thousands of servers.

Note: Grace Hopper, a mathematician and pioneering computer scientist who was one of the first programmers of the Harvard Mark I computer, had great farsightedness, as is evident from one of her famous quotes from the 1970s: “In pioneer days, they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers but for more systems of computers.”

Grace Hopper's analogy of oxen. Just like growing a bigger ox is expensive and hard as compared to using multiple average sized oxen, building bigger and bigger computers is not feasible
Grace Hopper's analogy of oxen. Just like growing a bigger ox is expensive and hard as compared to using multiple average sized oxen, building bigger and bigger computers is not feasible

Some argue that a single computer today is so capable that we should rethink if we need distributed systemsMotherDuck: The Simple Joys of Scaling Up. It is indeed true that we have come a long way—today's smartphone is as capable as a supercomputer of the past and at a much lower cost. Driven by our insatiable desire to do more, humans have attacked bigger, more complex problems when their computational tools improve. Further, the above-quoted article seems to ignore the following aspects:

  • Due to the distance of the service to its dispersed client base, clients away from a central service will incur a longer latency. Replicating services near the customers will need a distributed system.

  • The article assumes that Moore's lawMoore’s law - Wikipedia. (2020, April 1) and Dennard scalingDennard scaling - Wikipedia. (2014, January 23) will continue—in fact, the article ignores Dennard scaling altogether. In recent years, researchers and engineers have struggled to achieve the performance gains that would match the predictions of Moore's law and Dennard scaling. Additionally, having more transistors might not translate to higher performance.

  • The failure cases are not adequately discussed, such as the outage of a full data center and the replacement of a failed central server.

  • Multiple cores in a server and NUMA-based memory access mean that a single server these days has many traits in common with a distributed system.

Our takeaway from this discussion is that while modern servers can help reduce the overall fleet size (hence the cost benefits), we can't entirely do away with the need for a distributed system.

Cost efficiency: As we try to make computational equipment bigger/more capable, the associated dollar cost does not increase linearly (cost often goes up super linearly), often making such an endeavor prohibitively costly. For example, we might wish to have a network switch to enable any-to-any communication, with high bandwidth, amongst hundreds of thousands of servers residing in a data center. But no such switch exists, and instead, we use smaller switches in a special cost-efficient Clos (read as Klo) topologySee the seminal paper by Al-Fares et. al. on how they connected hundreds of off-the-shelf switches to make networking possible in a data center: A Scalable, Commodity Data Center Network Architecture. Available at: https://dl.acm.org/doi/pdf/10.1145/1402958.1402967 to make it happen. Similar cost efficiency is at play in a modern data centerSee: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition at https://research.google/pubs/pub41606/.

Commodity network switches combined to connect many computers with redundant network paths
Commodity network switches combined to connect many computers with redundant network paths

Fault tolerance: A computer can fail due to a myriad of hardware or software glitches. A distributed system enables a service to hide most of the component level (or even data center level) failures from the end customers, giving the operators the necessary time to repair the failed components. For example, our data might be geo-replicated at three places. Losing any one of them will trigger the service to reroute our read/write requests to a secondary copy with little to no hiccups for us. Over time, the service will either restore the failed copy, or a new server might be assigned to take over the failed server.

Note: You can dive into understanding how data replication works in a fault-tolerant manner by studying Google's GFS file system design.

Reduced latency for the customers: If a service is only hosted, say, on the West Coast of the United States, the customers on the East Coast will need to deal with larger delays, and the customers in Europe and Asia might suffer much longer delays. Physical limitations, such as the speed of light, make it impossible to communicate any faster. Therefore, it becomes necessary to bring the service near the customers. Common examples are Netflix and other media that are time-sensitive and are often served from near the customer via content delivery networks.

The following illustration shows how major cloud computing players distribute their data centers and connect them via specialized networks to provide low-latency services to their customers.

AWS (top left), Azure (top right), and GCP's global infrastructure
AWS (top left), Azure (top right), and GCP's global infrastructure

Hopefully, we are convinced by now that modern services need distributed computational infrastructure to fulfill their goals. And now that we have opened this Pandora's box, let's see what challenges await us.

Why distributed systems are hard#

We’ll discuss some of the most difficult aspects of distributed systems:

  • Partial failures are common.

  • The network can lose, delay, duplicate, or reorder messages.

  • Failures can be hard to detect.

  • Node clocks are unreliable.

Partial failures are common#

A single server/computer is designed so that when an unexpected fault happens, the system declares itself failed (instead of producing wrong results.) Examples are kernel panic in the Linux operating system and the “blue screen of death” in the Windows operating system—on unexpected faults such as memory corruption, they halt instead of working unreliably and producing possibly wrong results. Therefore, such a system mainly works in a binary fashion—either the system is working correctly or has failed and has stopped working. This binary behavior makes it easier to deal with them.

On the contrary, in a distributed system of many nodes, one or some nodes might have failed while the rest are working. Therefore we don't have that binary state (fully working or failed and stopped); instead, partial failures are possible. Such partial failures add non-determinism to the system. In fact, in a large distributed system, it is often assumed that something is always broken. Naturally, detecting and managing such failures is necessary.

Note: Large supercomputers such as FrontierFRONTIER - HPE CRAY EX235A, AMD OPTIMIZED 3RD GENERATION EPYC 64C 2GHZ, AMD INSTINCT MI250X, SLINGSHOT-11 solve many scientific problems, and often on some fault, the whole system is halted, the fault is corrected, and the system resumes by loading from the last checkpoint. However, a large distributed service such as Google Search can't operate like that.

#

Transient network problems are also common. Such problems include route flapsRoute flaps refer to the rapid and frequent changes in the reachability of routing information in a computer network. A route flap occurs when a router repeatedly advertises and withdraws routes in a short period of time, resulting in instability and fluctuations in the network routing tables. due to changing network routes and the time required for the network routing tables to reach a stable state after any link change, such as link failures. Additionally, it is possible that some nodes can receive incoming traffic, but their outgoing traffic is not reaching anyone. In such a scenario, such a node might be considered failed by everyone else, while the node might consider itself an active part of the group. Many consensus algorithms, such as Raft, rely on a single leader. If a leader considers itself a leader while the rest of the group can't reach it and considers it failed, this situation can wreak havoc if care is not taken in the design of the consensus algorithm.

The network can lose, delay, duplicate, or reorder messages#

Networks are often modeled as synchronous or asynchronous. A synchronous network is one where message delivery has a known upper bound, and we can figure out which messages were lost. On the contrary, in an asynchronous network, a message can take arbitrarily long to reach its recipient. This means in an asynchronous model, timeouts are not a good detector of lost messages.

A real network, such as the Internet, is neither purely synchronous nor asynchronous. Instead, most of the time, the Internet remains in a good state, behaving like a synchronous network, while when congestion is high, it might behave more like an asynchronous network. That partial synchrony is helpful, but we need to deal with occasional asynchrony—protocols should be able to detect and manage when the network becomes asynchronous.

A typical network delay has many components, such as transmission, queuing, and propagation delays. The queuing delays are usually responsible for high variability (jitter) in network latency. The Internet is often dubbed as a network of queues, where a packet gets through a queue at one router only to get queued in the next. Like a busy traffic intersection during rush hour, these queues can build up, adding substantial delay to the packets.

When every road leads to a specific port, congestion is almost guaranteed to happen
When every road leads to a specific port, congestion is almost guaranteed to happen

Packets can be dropped at a queue if there is no space to queue them. Senders might send multiple copies of the same data if they don't receive acknowledgment of a previous message in a reasonable time. It might be the case that an earlier copy is sitting in some queue, waiting for its turn, and ultimately more than one copy of data reaches the recipient.

Failures can be hard to detect#

It is often hard to detect a node failure. One of the reasons is due to message losses over the network. Not getting a response from the node is not a sure signal that the node has failed. The message might have dropped midway, or the acknowledgment might not have reached the sender. This is the classic Two Generals’ problem.That is Two General's problem. We are adding it in the course: https://www.educative.io/courses/grokking-the-principles-and-practices-of-advanced-system-design Yet another reason could be that the process is unable to reply at the moment. For example, the node may be paused due to stop-the-world garbage collection, and the virtual machine may have been halted by the hypervisor, etc.

Note: Timeouts are not a reliable detector of node failure in an asynchronous system.

The three scenarios of message lost, unresponsive recipient, and reply lost are indistinguishable for client
The three scenarios of message lost, unresponsive recipient, and reply lost are indistinguishable for client

The node pausing at unexpected times can make programming challenging. As an example, see the following code snippet where a node acquires a lease and ensures that it has sufficient time left on the current lease (line 6). If this node pauses, say, just before executing line 11 for 20 seconds, on the resumption of the node, the code assumes that it still has a valid lease, while it does not. This scenario is a lot like concurrent code, where we can't rely on timing. (We will see a possible solution to this situation a little later in this blog.)

// Request processing loop
while (true) {
request = getIncomingRequest();
// Ensure that the lease always has at least 10 seconds remaining
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
lease = lease.renew();
}
if (lease.isValid()) {
process(request);
}
}

The following illustration pictorially shows the challenges due to unexpected node pauses. Such pauses are not hypothetical. They have been reported in the fieldSee: https://dzone.com/articles/how-tame-java-gc-pauses and https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic.

Long process pauses make life harder
Long process pauses make life harder

Node clocks are unreliable#

We need some notion of time for many tasks, such as detecting timeout on a network message so that it may be retransmitted, logging the amount of time a client spends on different pages of our website, and tracking when a DNS cache entry will expire and will need to be ejected from the local cache. Unfortunately, it is impossible to perfectly synchronize clocks on different nodes: despite using specific protocols such as NTP, many challenges persist.

Note: Time-of-day clocks and monotonic clocks are different and are used for different purposes. Using time-of-day clocks instead of monotonic clocks can lead to serious errors. Monotonic clocks are designed in such a way that they can't go back in time. However, if need be, they can be made a bit faster or slower to adjust the skew in clock value.

We will see two examples where unsynchronized clocks can cause damage.

The first example is of Cloudflare. In the following code snippet (at line 7), the code is trying to find the duration between now and an earlier time start. Though at that specific point, the clock was set backward such that now - start gave a negative value. It was assumed that rtt would always be positive, and unexpectedly getting a negative value had a cascading impact, failing their service for many of their customers. This problem's root causehttps://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/ was that some code assumed the time value could not go negative. But due to adjustment regarding the leap second, the time-of-day clock went backward.

Note: We often assume that if two processes are on two different nodes, their failures will be independent—meaning that the failure of one node won’t impact the other processes. However, as illustrated by the Cloudflare example where the same bug impacted many of their nodes in a data center, that is not always the case: a bug in one node may impact multiple nodes.

// Following code is taken from the Cloudflare blog
// where they highlight how an unexpected negative
// value impacted their service in 2017
// See for details:
// https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/
if !start.IsZero() {
rtt := time.Now().Sub(start)
if success && rcode != dns.RcodeServerFailure {
s.updateRTT(rtt)
} else {
s.updateRTT(TimeoutPenalty * s.timeout)
}
}

Our second example is about a writing mechanism used in replicated data storage known as last write wins (LWW). If nodes use wall-clock time (that we are unable to synchronize perfectly), even an unsynchronized clock at a millisecond (ms) level can produce data consistency issues.

In the following illustration, it is clear that although x=2 happened after x=1, node 2 sets x=1 as the final value because the timestamp of 9.0049.004 associated with x=1x=1 is larger than the timestamp of 9.0039.003 attached with x=2x=2. As a result, we have inconsistent data for the same variable: nodes 1 and 3 have 2 as the final value of x, while node 2 has 1 as its final value.

Relying on physical, unsynchronized clocks can generate wrong results
Relying on physical, unsynchronized clocks can generate wrong results

By now, we should be convinced that strange, unexpected things can happen in distributed settings that make designing software for distributed systems a challenging task. In the following section, we will see how researchers and system designers deal with the issues.

Managing challenges#

In this section, we survey some of the ways the above-mentioned challenges are managed. Researchers have devised rigorous formalisms to thoroughly understand the problem and to provide possible solutions. Let’s now discuss system models, failure models, and the correctness properties of systems.

Note: Challenging problems, especially those whose solution has substantial rewards, attract human curiosity and intellect. Over the last few decades, researchers and practitioners have worked hard to make today’s household services like Google, Facebook, Netflix, etc., possible. You can learn how YouTube works here.

Correctness of a distributed system#

With all the problems we explained above, a designer must have a strategy to determine the correctness of the system. We use safety and liveness conditions for this purpose. Safety means that nothing bad ever happens, and liveness means that something good eventually happens. If you want a more formal definition of safety and liveness, see the paper Defining Liveness. Alpern, Bowen, and Fred B. Schneider. 1985. “Defining Liveness.” Information Processing Letters 21 (4): 181–85. https://doi.org/10.1016/0020-0190(85)90056-0.

Examples of using the safety and liveness properties are consensus algorithms, such as Paxos and Raft. The safety condition could be that a minority never changes a value chosen by a majority. The liveness condition can be that the consensus makes progress as long as a simple majority of participating nodes are available.

There is often a tradeoff between safety and liveness properties. Another reason to analyze the state of the system through two separate lenses, safety and liveness, is that for many use cases, it is acceptable to sacrifice liveness (at least for a while) as long as we never compromise on safety. In the context of consensus algorithms, the Paxos and Raft systems are always consistent (a safety condition) but might not make progress (a liveness condition) when a majority is unavailable.

Computational models#

It turns out that the execution environment in which a distributed system operates has a major impact on what can (and cannot) be done. Researchers have devised two kinds of computational models—synchronous and asynchronous.

A synchronous model is one where:

  • Nodes take a known bounded time to process any requests.

  • Network messages take a known bounded time to reach their recipients.

  • The clocks on different nodes have a known bound on the skew that is never violated.

This is a good model where everyone behaves nicely.

An asynchronous model is one where:

  • Nodes can take arbitrarily long to process requests.

  • Messages can take arbitrarily long to be delivered over the network.

  • The clock skew is also not bounded.

This is a harsh model of a computational environment with multiple sources of uncertainty and incomplete information.

The question is how closely our real computational environment matches these theoretical models. The reality is neither as pretty as in the synchronous model nor probably as bad as in the asynchronous model. In some sense, synchronous and asynchronous models are two ends of a spectrum. They help us prove exciting things such as:

  • X is impossible in a synchronous model.

  • Y can be done in an asynchronous model. 

Implicit here is the notion that if X is impossible in a nice synchronous model, it is definitely impossible for anything farther along the spectrum, and if something is possible with a tough model, it is definitely possible with a model with less uncertainty.

Fault models#

Once again, we define a spectrum, this time, of failures—the easiest to manage, we call crash-stop failures, while the hardest to manage, we call Byzantine failures. They serve a similar purpose as in our discussion on computational models: to prove properties.

Crash-stop failures are those where when a fault occurs, the affected node stops working and does not participate in any further processing. The Byzantine failures are arbitrary failures, where participants can behave in an unexpected (possibly malicious) manner.

Spectrum of failure models: from easier to harder
Spectrum of failure models: from easier to harder

Note: We recently added a new section on consensus protocols in our course on Grokking the Principles and Practices of Advanced System Design. You may read that section to see safety, liveness, and computational and fault models in action.

Dealing with network challenges#

TCP can hide message losses, message reordering, and message duplications. However, hiding message latency is not always possible. System designers go to great lengths to reduce latency for their customers. For example, Netflix puts its hardware inside ISPs“Netflix | Open Connect.” n.d. Openconnect.netflix.com. https://openconnect.netflix.com/en_gb/. to give its customers excellent quality of service with low latency.

Because a misbehaving network is at the root of many challenges, hyper scalers such as Google have dedicated substantial effortsSee for further deails: https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf https://conferences.sigcomm.org/imc/2013/papers/imc253-calderA.pdf https://www.usenix.org/system/files/nsdi20-paper-burnett.pdf to reduce such issues and to continuously monitor the network.

Dealing with node failures#

Modern services rely on extensive monitoring to deal with failed nodes. Still, it is possible that monitoring declares some nodes failed when it is not. To deal with such issues, systems are built in such a way that safety conditions are met (for example, in the context of consensus, there is never more than one leader). Similarly, implementing at least onceWhen a service provides at-least once guarantee to its clients, that means that some operation can be performed 1 or more times. semantics and idempotencyAn operation is idempotent if executing it multiple times produces the same results. can come to our rescue when handling repeated occurrences and processing the same data in order to preserve correctness.

Logical clocks and clocks with confidence intervals#

We saw that synchronizing physical clocks on different nodes using network protocols like NTP is challenging. Leslie Lamport invented logical clocks that can be implemented using simple counters and suggested tagging each network message with these counters. A desirable property of such clocks is that they should support causality. For example, if Event A happens before Event B at a single node, the local logical clock value should be assigned such that the clock value of Event B is larger than the clock value of Event A. When a node sends a message to another node, the clock value at reception should be higher than the one when the message was sent.

Note: See our chapter on sequencers to know how to generate monotonically increasing, causality bearing counters.

Let's see how we can solve our earlier LWW issue with unsynchronized clock values. The following illustration shows that sending a monotonically increasing fencing token in each request helps us avoid the pitfalls that we saw earlier. When a node receives an old token with a request, the node rejects such requests.

Use of fencing token to avoid the use of clock value as in last-write-wins strategy
Use of fencing token to avoid the use of clock value as in last-write-wins strategy

The next innovation came with Google's Spanner, which introduced TrueTime API. The interesting observation was that while it is true that clocks on nodes of a distributed system cannot be perfectly synchronized, how about explicitly exposing the uncertainty in the clock value to the applications while trying hard to keep the uncertainty values low?

Spanner uses atomic and GPS clocks to keep the clock uncertainty around 6 ms. When TrueTime API gives a clock value, it also tells the current uncertainty bound.

Time with controlled uncertainty might not seem much on the surface, but it is a significant feat. Spanner being a strongly consistent database at the global scale owes quite a bit to its TrueTime API.

The following illustration shows that a timestamp of ss returned by the TrueTime API has an error bound of ϵ\epsilon on either side of ss. That effectively means the current time is anywhere between the earliest and the latest time.

Epsilon is the uncertainty in the clock time. Once twice of epsilon time has elapsed, an application can use the relevant timestamp with surety, for example, for a data snapshot
Epsilon is the uncertainty in the clock time. Once twice of epsilon time has elapsed, an application can use the relevant timestamp with surety, for example, for a data snapshot

The following illustration shows that the uncertainty in the clock increases over time, but the Spanner system strives hard to keep it below 6 ms on average. The slanted lines in the following illustration show a gradually increasing skew in the clock value. Occasionally, the uncertainty can go beyond this average value, and Spanner and its applications have special protocols to deal with such situations (for example, waiting out the uncertainty when they want to be sure that a time point has definitely passed.)

Before the client asks for TrueTime
Before the client asks for TrueTime
1 of 7

Conclusion#

Distributed systems are fundamentally different from writing software on a single computer. A distributed system can fail in a myriad of new, unexpected ways. A service built on a distributed system aims to shield its customers from all possible failures and yet serve its purpose well. In this blog, we surveyed some of the primary challenges in building distributed services and how we can manage those challenges. However, we were only able to scratch the surface, and our purpose was to stimulate your interest. To go deeper into these topics, we recommend our following courses: Distributed Systems for Practitioners and Grokking the Principles and Practices of Advanced System Design.


  

Free Resources