What are distributed systems? A guide for beginners

What do Google’s web search, SETISearch for Extraterrestrial Intelligence's search for alien life, and EHTEvent Horizon Telescope's black hole photography have in common? They all involve massive amounts of data that need to be processed and stored and can be searched for information and patterns. To do this efficiently, all these tasks rely on a distributed system. In this blog, we’ll see what a distributed system is and how distributed computing differs from parallel computing. We’ll look at various popular applications that benefit from a distributed design. We’ll also discuss the benefits of distributed computing and the various challenges that arise when implementing them. Let’s go!

Going by the textbook definition of a distributed system, it is simply a collection of independent computers that work together towards a common goal. The processing is called distributed computing. To an outsider, the distributed system would appear as a single entity. For example, when we enter some search words on Google, we get a response (almost) instantaneously. We simply accessed a web page that, to us, seemed like a single server serving our request. In reality, our request went to several (100+) servers that collaborated to serve us. When you realize that Google indexes and searches billions of web pages to return millions of results in less than a second, you’ll begin to wonder whether having a single machine doing that is adequate. You’ll realize that a better approach would be to split the task into multiple subtasks that are then assigned to separate machines that work independently of each other.

Unlike distributed computing, parallel computing consists of splitting a task into subtasks and assigning each to a different CPU (or different cores in a single CPU) on the same machine instead of different ones. For small tasks, this is doable. But what if the task is large? We would need to scale up.

Scaling up, also called vertical scaling, consists of adding more resources to the same machine. For example, we could increase the number (or processing power) of CPUs (and/or cores), use a bigger RAM, use storage media that is faster and has more capacity, and increase network bandwidth. There are three issues with this approach that make adoption difficult. One, the state of the art imposes some limits beyond which we can’t power up the machine. For instance, increasing the CPU’s processing power means adding more circuitry on the chip. However, this produces heat, which could potentially melt the system. Second, unlike distributed systems, where each machine has its own resources, having multiple processors that compete for various system resources would create a bottleneck at the resource access point. Third, the cost doesn’t necessarily increase linearly when we scale up. For example, moving from a CPU with 16 cores to one with 32 cores would more than double the price.

The web search example covered previously is a perfect example of scaling out, also called horizontal scaling. Here, as the task’s requirements increase, instead of making the machines more powerful, we simply add more machines. These are independent nodes with their own resources.

For instance, suppose a web server can serve, on average, 1,000 requests per second. Suppose the website gets popular, and the traffic increases to 2,000 requests per second. We can easily scale up and add more RAM, faster storage media, and higher bandwidth links to cater to these requests. But what if the traffic sees a massive increase, and now the website needs to serve one million requests per second? It might not be possible (or feasible) to enhance the computational capabilities of the computer system to such an extent. In this scenario, we would need to scale out. We would need to add multiple servers, each serving a subset of the incoming requests. As the traffic increases, we would incrementally add more servers to this distributed system and split the incoming requests between them.

We have already seen one major benefit of distributed systems: improving performance and achieving scalability. Sharing of resources is another important benefit. If a machine is not being used, it would be better to let other users and applications utilize its resources rather than them going to waste. One of the most critical benefits is resilience to failure. If a system fails, the availability of the replicas of the content/service on different machines would ensure that the data is not lost and that the service would remain available.

There are several applications that rely on distributed systems. Let’s get a feel for the different types of distributed applications.

Searches and scientific computing: These involve searching huge amounts of data, which would otherwise take several hours if done on a single machine. Search engines that crawl the internet and then allow searches on petabytes of the indexed data are one example. Other search applications include searching scientific data for information, compiling insights, and identifying recurring patterns. For example, the SETI@home“SETI@Home.” 2019. Berkeley.edu. 2019. https://setiathome.berkeley.edu/. program distributes radio telescope data over participating machines, and each machine searches for signs of intelligent life in the data assigned to it. This is an amazing example of volunteer computing, where volunteers contribute their computing resources to solve various computationally intensive problems. Folding@home“Folding@Home.” 2016. Folding@Home. 2016. https://foldingathome.org/. is another such effort. In this project, protein folding simulations are assigned to participating machines, and each returns the result to the project database. The results enabled advances in biomedical research.

Training models: Training machine learning and deep learning models involves huge datasets. Processing these on a single machine would be time-consuming. Distributing the processing over multiple machines would help save time. More recently, large language models have appeared that involve training on an enormous text corpus typically measured in petabytes. It involves different books, web articles, text on websites and forums, etc., and that, too, in different languages. Maintaining all this data and processing it would not be possible on a single machine. To understand the scales involved in a real-world training project, according to a CNBC articleVanian, Jonathan, and Kif Leswing. 2023. “ChatGPT and Generative AI Are Booming, but at a Very Expensive Price.” CNBC. March 13, 2023. https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html., we see that Meta’s first LLaMA model used more than 2,000 NVIDIA A100 GPUs to train on 1.4 trillion tokens, and it still took around 21 days—totaling around one million GPU hours.

Cloud computing: This was one of the biggest apps of the last decade and will continue to gain importance. Major players, such as Amazon Web Services, Microsoft Azure, and Google Cloud, created large data centersA large group of (typically several hundred) networked computer servers used for remote storage and processing of large amounts of data.. These companies then allowed users to use resources on these machines per their needs. Billing is based on the amount of resources used and the total duration they are used for. These resources include storage, processing cycles, and bandwidth. If a user isn’t using a resource, they are not charged for it, and instead, the resource is allocated to another user. Further value-added services—such as resource usage reporting, insights for optimizing costs, ransomware protection, and dedicated technical advisory services—can also be provided on top of these basic services. This approach brings efficiency to resource utilization, and people with requirements that vary over time now have the flexibility to add or remove resources based on their requirements and pay accordingly.

Web services: Various web services that handle thousands of requests per second also rely on distribution to achieve massive scale. Weather APIs, news feeds, etc., that are used by individual users and other websites benefit from a distributed architecture. By distributing traffic over different machines, they are able to scale. If the machines are spread over different geographical areas, the incoming requests can be routed toward the nearest server, thereby reducing delays and improving performance. The system also becomes more fault tolerant because if a single server fails, the remaining servers can take over.

Content distribution networks (CDNs): Before cloud computing took off, we had CDNs like Akamai. Various websites and multimedia streaming services benefit from them. CDNs help improve performance by mirroring content on servers that are placed all over the world, and end user requests are directed toward the nearest server. This reduces the load on a single machine and also helps reduce delays by serving the request from the nearest location. Also, even if a server fails, the user would be able to get content from an alternate server nearest to the user’s location.

Data storage and sharing: Various distributed system applications are targeted toward file sharing. For instance, various torrent clients, such as BitTorrent, create a peer-to-peer network between the active users, and users simultaneously download small chunks of the file from multiple peers who can transmit it the fastest. A distributed approach ensures that files are available even if some of the peers are offline, and by downloading in parallel, the system is able to ensure that the aggregate system bandwidth is utilized optimally.

Critical operations: Banking, for instance, is a service that needs a very high fault tolerance. If a server is offline, banks can neither afford to lose customer data nor make the service unavailable for their customers. The consequences would be catastrophic. Distributing and mirroring the service over multiple machines ensures service availability and security of the data.

Decentralization: In recent years, efforts have picked up pace to decentralize and democratize the internet. Various applications have been created in this regard. Blockchains, such as Bitcoin and Ethereum, and private browsers, such as Tor browser, are fully distributed and aim to ensure privacy and provide independence to the users as they try to reduce centralized control.

Distributed computing makes system design way more complex. There are assumptions and simplifications that hold true in a single-machine system but don’t work in a distributed environment. Let’s look at some challenges now.

Coordination: Coordination is an important challenge that needs to be addressed. Multiple computers are collaborating toward a common goal, so who gets to decide which task will be assigned to which computer? Assigning tasks in a manner that optimally utilizes overall system resources is nontrivial. Adding to the complexity is the fact that the nodes would have varying capacities and that an assignment that is optimal in terms of resource utilization might not be optimal for the end user. Users want quick response times and might need to access a machine that is closest to them to reduce latency. This might not necessarily be the most efficient in terms of resource allocation. In case one of the machines is to be assigned a leadership role so that it’s responsible for this coordination, we need consensus algorithms that would be used by all the participating machines to elect a leader. This process would incur delays and might become a performance bottleneck if there is a high churnNodes rapidly joining and leaving the system and not giving it a chance to come to a stable state. This is common, for example, when multiple users join a live video stream and might leave or rejoin in the first few minutes..

Connectivity: A related challenge is maintaining a communication network between the nodes. There are two approaches to do this. On one extreme, we have the client-server architecture where all nodes communicate only with the centralized server. The server has the complete picture of the network and can make optimal decisions. However, the server becomes a bottleneck and a single point of failure. This is especially true if the number of clients is big.

On the other extreme, we have the peer-to-peer architecture where there is no centralized server, and each node communicates with all other nodes (or a subset of them) to make decisions and convey them to others. Here, the control traffic might become a bottleneck if the number of peers gets huge, but no single machine is overloaded, as is the case in a client-server approach. Searching for the machine that has the required data or can work on the required task is more difficult in this approach as opposed to the client-server approach, where the server is responsible for answering requests. So there is always a trade-off that the system designer needs to be aware of. Sometimes, the system designer might go with a hybrid approach, which tries to bring in the best of both worlds.

Consistency: Maintaining a consistent state all across the system is another key challenge. For fault tolerance, distributed systems usually mirror processes and tasks across multiple servers that might be spread over a wide geographical area. This mirroring is not as easy as it sounds. For example, if one machine crashes, it’s easy to identify that because it would stop responding to messages. But what if the machine is working, but the data is corrupted? This is called a Byzantine fault, and a majority vote is needed between multiple machines to decide which version of the data is the corrupted one. This, of course, incurs an overhead. If multiple instances get corrupted simultaneously, we would need to include more nodes in the voting to determine the correct value.

Updates to data also become complex when data is mirrored across multiple machines. Due to network limitations, the updates can never be instantaneous. A problem would arise if a user accesses data on a machine that has not yet received the update while another user accesses a machine that has been updated. Both would have different data, and the system state would become inconsistent. Consistency models and algorithms are therefore an important area of system design. Please also note that because of communication delays, maintaining a universal time between all the machines also becomes difficult. There are algorithms that help achieve this, but if the clocks go out of sync, many applications that rely on timestamps would fail. For example, if multiple users are doing financial transactions, we need to know the order of transactions.

Message-oriented middleware (MOM) and message queue middleware, including Apache Kafka and Apache ActiveMQ, allow asynchronous communicationCommunicating nodes don’t have to be online at the same time and can get messages once they come back online. through message exchanges, enabling communication patterns such as publish-subscribe messaging, where some components produce data and the middleware forwards these as messages to the components that have shown their interest in them. Remote procedure call (RPC) middleware like Java RMI and gRPC facilitate seamless execution of code on remote systems.

Database middleware, including ODBC and JDBC, provides standardized interfaces for efficient database access. Together with transactional middleware like Java Transaction API, it ensures the ACIDACID stands for “atomicity, consistency, isolation, and durability.” These properties ensure that the database operations grouped together as a transaction when executed ensure consistency in the face of unexpected errors. properties of distributed transactions. Web services middleware like Spring Web Services and Apache CXF facilitate web-based communication using protocols like SOAP and REST. Dataflow middleware, such as Apache Hadoop, enables big data processing through its MapReduce programming modelThe aggregator machine splits data between multiple worker machines, which process it and return the results back to the aggregator..

Each middleware type addresses specific aspects of distributed system challenges. There is a vibrant ecosystem of middleware out there. Looking at the popular ones in each of the middleware types mentioned above would be a good starting point in developing distributed systems.

What are distributed systems? A guide for beginners

What is a distributed system?#

Scaling up and the limitations of parallel computing#

Scaling out via distribution#

Why go for distributed computing?#

Distributed computing—not so simple#

Middleware: Don’t do unnecessary work#

Navigating the distributed landscape#