Consensus algorithms typically handle node crashes, network partitions, and message delays. Some algorithms are more tolerant of certain types of failures than others.
Consensus in distributed systems refers to the process of achieving agreement among multiple nodes or processes on a single value or operation. Multiple algorithms exist to achieve consensus under different scenarios, such as leader election among a cluster of nodes, achieving consistency in data replication across multiple databases, acquiring distributed locking, and so on. These algorithms provide the necessary framework to coordinate distributed systems (nodes) and enable them to work together even in the event of failures or cyberattacks.
Consensus algorithms are vital for System Design, particularly when addressing Byzantine failures involving nodes exhibiting arbitrary or malicious behavior. In fact, every large-scale distributed system often uses consensus in daily operations.
In this blog, I’ll present the theoretical foundations and practical algorithms to achieve consensus and their importance in the System Design domain.
Key takeaways:
In this blog, you will understand the following:
The essential traits of consensus algorithms (protocols).
The working of the following 4 main consensus algorithms and their limitations:
The two-phase commit (2PC) protocol
The three-phase commit (3PC) protocol
The Paxos algorithm
The Raft algorithm
The consensus algorithms used in different System Designs such as YouTube, Instagram, Uber, Facebook, and so on.
In the end, you will see some real-world applications of consensus algorithms and practical considerations to choose a consensus algorithm.
Consensus algorithms enable nodes to achieve agreement in different and unpredictable environments, making them crucial for today’s computing infrastructure. The following are some of the key properties of consensus algorithms that make them the unsung heroes of distributed computing:
Some key properties of the consensus protocols:
Safety (data consistency and integrity): To ensure data consistency and integrity, the system should never produce an incorrect result.
Liveness (agreement): These algorithms ensure that each correct node decides on a value to guarantee progress.
Fault tolerance: The system should operate correctly during the failure of some nodes without compromising safety and liveness.
Immutability: The consensus decision should be irreversible to ensure the system’s stability and predictability.
Termination: The consensus decision should be finalized in a bounded number of steps, and the system should eventually reach a decision.
Decentralization: The consensus algorithms should minimize dependency on a central authority or node.
Let’s talk about some of the following important consensus algorithms/protocols:
Two-phase commit (2PC)
Three-phase commit (3PC)
Paxos consensus algorithm
Raft consensus algorithm
Introduced in 1978, the 2PC algorithm was developed in the context of database transactions. Relational databases provide the ACID properties, where A stands for atomicity. It means that a transaction either commits in total or aborts; in other words, it happens completely or it doesn’t occur at all. When we want different systems to either commit or abort together, we use the 2PC algorithm.
A common use case for 2PC is when a transaction reads and writes multiple database shards, and an application wants to ensure that all such changes happen as one unit. In this case, consensus means that every participant should be on the same page. The 2PC protocol can be used in any scenario that requires atomicity across all participants.
The assumption behind the 2PC protocol is that they utilize a leader or coordinator node and some other nodes called cohorts. The purpose of the coordinator is to manage the state, gather votes, and serve as the main reference point for the agreement process. The cohorts operate on replicas against which transactions are performed. The coordinator and cohorts maintain local logs for every step that is executed.
There are two phases in the 2PC protocol: the prepare phase and the commit phase, as illustrated in the following illustration:
Primarily, the 2PC protocol has the following drawbacks:
Blocking issue: If the coordinator fails during the commit phase, it can block the entire working of the 2PC protocol. In such a case, cohorts may be left in an uncertain state, waiting indefinitely for the coordinator to recover and decide whether to commit or abort.
Network partition: After receiving votes from cohorts, the network partition before providing any decision (commit/abort) by the coordinator can leave the cohorts in an indefinite waiting state.
Note: You can have a look at the detailed working of the 2PC protocol to expand your understanding.
The 3PC protocol was introduced in 1983 to address the limitations of the 2PC protocol. It is built on top of the 2PC protocol, having an additional step along with the prepare and commit phase, which is the pre-commit step.
The 3PC operates under the assumption that messages between cohorts and coordinators will be delivered within a known, fixed time. In case of coordinator failure, the additional step in 3PC allows cohorts to proceed with either commit or abort depending on the system state.
The 3PC protocol coordinates state transitions between the coordinator and cohorts to ensure that all nodes agree on the same decision. Cohorts must wait for all nodes to complete the previous phase before moving on to the next phase and can abort the transaction if they do not hear from the coordinator before the timeout. Unlike 2PC, 3PC can recover from coordinator failures and allows cohorts to proceed with a deterministic decision.
The following limitations could be the primary reason why 3PC is not commonly used in practice:
Network partition: The worst-case scenario for 3PC is a network partition where some nodes successfully pass to the pre-commit state and proceed with the commit. In contrast, others cannot communicate with the coordinator and will abort after the timeout. This results in a split brain, leaving participants in an inconsistent and contradictory state.
Higher message overhead: It also involves a higher message overhead, may introduce potential inconsistencies, and may not perform well.
Note: You can have a look at the working of the 3PC protocol to understand its detailed working.
Paxos is a consensus algorithm used by multiple entities to agree on something of interest, such as maintaining consistency among multiple copies of data.
To understand Paxos, let’s look at the following analogy:
Let’s assume that a group of people wants to decide on a place to have lunch. There are three different types of people involved in the decision. Some knowledgeable persons suggest good places, some agree or disagree with the suggested places, and some wait for a final decision without providing input. In Paxos, the three types of roles are proposers, acceptors, and learners. Let’s discuss each of these roles in light of the Paxos algorithm, considering that these roles are associated with some nodes in a cluster (or replica group):
Proposers: These nodes send their proposals to other nodes in a replica group. Each proposer aims to have a majority of votes for their proposals.
Acceptors: These nodes either accept or reject the proposed value from the proposers.
Learners: The acceptors inform these nodes of the accepted value chosen via a higher majority.
There are two steps involved in the Paxos protocol:
The handshake phase (also known as the prepare phase)
The value acceptance phase (also known as the accept phase)
In the first step, the proposer sends a proposal number to the acceptors to gain majority votes without sending them any value. Mind the difference between the proposal number and the proposed value. In the second step, the majority vote-receiving proposer moves ahead to propose the value and sends a request to the majority of acceptors (the ones that voted for it). If the majority of acceptors accept the value, then that value is chosen for all the replica group nodes.
The following illustration shows the steps involved in the Paxos algorithm:
Knowledge test!
Why are the two phases necessary in the basic Paxos algorithm? Why can’t we have just one phase?
How would the network partition impact the Paxos algorithms in light of the CAP theorem?
The following are some limitations of the Paxos protocol:
Performance inefficiency: The Paxos requires several rounds of communication between servers to reach a consensus. This can result in reduced performance, which can cause high latency.
Scalability issue: As the number of servers increases, the communication overhead and coordination complexity also increase, which limits the protocol's scalability.
The Raft protocol is primarily used in leader elections in a number of servers. Its working mechanism is based on the election between the servers in a cluster. It selects a leader based on a minimum of
Leader: In Raft, a leader handles all the client’s requests and disseminates them to the followers.
Follower: All other servers are considered followers apart from the leader. The followers receive direction from the leader and operate according to those requests.
Candidate: A server may transition from a follower to a candidate. A candidate is a potential leader appointed as a part of the election among the servers in the cluster.
Raft uses remote procedure calls (RPC) to communicate between servers. Primarily, it uses the following two types of messages:
RequestVote
: This call is initiated by a candidate during elections to request votes from other servers.
AppendEntries
: The leader uses this call to replicate entries into the log files of follower nodes. It also acts as a heartbeat for the leader to check if a follower is alive and connected.
In Raft, upon timeout, a follower transitions to a candidate state and sends RequestVotes
RPCs to all other servers in the cluster along with itself. A candidate wins the election if it receives a majority of the server votes. When the candidate becomes a leader, it sends AppendEntries
RPCs to all the followers which they append entries in their logs and acknowledge it to the leader.
The following illustration demonstrates how the Raft protocol works.
Note: We need to use election timeouts with randomness added to avoid the situation where multiple followers become candidates simultaneously.
Knowledge test!
How is a leader failure handled in the Raft protocol?
How does a follower catch up with the leader after failure?
How can you relate the CAP theorem to the Raft protocol, and how can the network partition impact its functioning?
The following are some limitations of the Raft protocol:
Single leader bottleneck: For log replication, the Raft protocols depend on a single leader, which can become a performance bottleneck and can become a single point of failure.
Inefficient log compaction: Log compaction is a process used to reduce the size of the log by creating a snapshot of the system’s current state. This process can be resource-intensive and can lead to performance degradation during heavy write activities or large logs.
Note: You can check out the following course to understand the above consensus algorithms in detail along with the state machine replication, the Byzantine general problems and practical Byzantine fault tolerance algorithms.
This course teaches you how large, real-world systems are built and operated to meet strict service-level agreements. You’ll learn the many building blocks of a modern system’s design by picking and combining the right pieces and understanding the trade-offs between them. You’ll learn about some great systems from hyperscalers such as Google, Facebook, and Amazon. This course has hand-picked seminal work in system design that has stood the test of time and is grounded on strong principles. You will learn all these principles and see them in action in real-world systems. After taking this course, you will be able to solve various system design interview problems. You will have a deeper knowledge of an outage of your favorite app and will be able to understand their event post-mortem reports. This course will set your system design standards so that you can emulate similar success in your endeavors.
Let’s see how each of the consensus algorithms can be used in some of the design problems. Please note that we have skipped the 3PC algorithm because it is similar to 2PC; the only difference is that it adds a layer of fault tolerance, making it useful for more reliable transaction processing, particularly in environments with higher latency.
Design Problem | Two-phase Commit (2PC) | Paxos | Raft |
Ensure consistency when processing video uploads and metadata updates. | Ensures consensus in leader election for managing video distribution and streaming servers. | Ensures leader election and log replication, useful for video distribution and metadata management. | |
Useful for consistent updates to user-generated content like questions and answers across multiple databases. | Can be used to maintain consistency in distributed databases that store user content. | Useful for managing distributed log replication, ensuring that user content updates are consistently applied. | |
Ensures that updates to map data or user input are committed consistently across servers. | Helps in achieving consensus on updates to geographical data or route calculations. | Ensures consistency in the replication of mapping data or geographical updates across nodes. | |
Can be used to ensure that ride-booking and payment transactions are consistently committed across multiple services. | Ensures that critical data like ride matches are consistent across multiple instances. | Ensures consistency in log replication for ride-booking and match-making services. | |
Used for consistent updates to user content like posts or stories across distributed services. | Ensures consistency in leader election for services managing posts and stories. | Similar to Paxos, ensures leader election and log replication for user-generated content updates. | |
Ensures that message delivery and status updates are consistently committed across multiple nodes. | Ensures consensus in message delivery sequences across distributed nodes. | Ensures consistent message sequence and delivery through log replication. | |
Ensures that document changes are consistently committed across multiple replicas in real-time collaboration. | Ensures consensus in the sequence of document edits across distributed users, maintaining consistency in real-time collaboration. | Ensures consistent document changes through leader election and log replication, maintaining smooth real-time collaboration. |
Let’s look at some of the real-world applications of the consensus algorithms in the System Design domain:
This section will discuss some of the consensus algorithm’s use cases in System Design.
Leader election: In System Design, we divide data into different shards and partitions stored on different servers. We also have redundant database servers to increase the system’s availability. In such a configuration, there needs to be a server (primary or leader) that will be responsible for all data read and write requests and communicate with the other servers in the cluster. In such a leader-follower (or primary-secondary) paradigm, consensus algorithms, specifically the Raft algorithm, are used to elect a leader. This process is often known as quorum-based consensus or leader election.
Consistent data replication: In redundant databases, consistent data replication across multiple replicas of DBs or filesystems is a challenging task. Consensus algorithms such as Raft and Paxos are used to ensure consistent data replication or updates across multiple replicas.
Distributed locking: Distributed locking is needed to manage access to shared resources in a distributed system or System Design, ensuring that only one node at a time can hold the lock and perform operations on the resource, thus preventing conflicts and ensuring consistency. This is crucial for coordinating tasks such as database updates, file access, and resource allocation in environments where multiple nodes might compete for the same resource. Consensus algorithms like Raft and Paxos are commonly used to implement distributed locking, as they provide mechanisms for achieving agreement on which node should hold the lock, ensuring that the locking mechanism remains consistent and reliable even in the face of node failures or network partitions.
Distributed transactions and distributed consensus: Consensus algorithms are used in distributed transactions and distributed consensus to ensure that all participating nodes agree on the sequence and outcome of transaction operations, providing consistency and fault tolerance. By achieving consensus on commit decisions and coordinating resource locks, these algorithms prevent conflicts and ensure that transactions are executed atomically and consistently across the distributed system.
Atomic broadcast: Consensus algorithms are used in atomic broadcasts to ensure that messages are delivered consistently to all nodes in distributed computing, even in the presence of failures. By achieving agreement on the order and content of messages, consensus algorithms like Raft or Paxos ensure that all participating nodes receive the same sequence of messages, thus maintaining a uniform state across the system and preventing discrepancies or conflicts in message processing.
Blockchain technology: Blockchain technology, a prominent application of consensus algorithms, supports blockchain networks, which are essential for various use cases, including cryptocurrency and distributed ledger technologies. For instance, Bitcoin utilizes the Proof-of-Work consensus algorithm to secure its network, while Ethereum, another leading blockchain platform, is transitioning from Proof-of-Work to Proof-of-Stake to enhance scalability and energy efficiency. These consensus mechanisms are fundamental to the functionality and security of cryptocurrencies and distributed ledgers, ensuring that all participants in the network agree on the state of transactions and data.
Here, I’ve provided some practical considerations that how to choose the right and suitable algorithm:
Factors to consider (network environment, fault tolerance, performance):
2PC is best suited for environments with reliable networks and minimal fault tolerance requirements due to its vulnerability to blocking when failures occur.
3PC improves fault tolerance by adding an extra phase to avoid blocking and handling network partitions, though it introduces additional complexity and may not fully resolve all failure scenarios.
Paxos provides robust fault tolerance and consistency in environments with significant fault tolerance needs but can suffer from performance degradation in large or high-latency networks.
Raft balances strong fault tolerance and performance with a simpler leader-based approach, making it suitable for most practical applications, although it may still face challenges in very large clusters.
Implementation challenges:
2PC is relatively straightforward to implement, but if any participant fails, it can lead to complex recovery processes and blocking issues.
3PC adds complexity with its additional phase, which increases the difficulty of implementation and failure recovery.
Paxos is known for its intricate design and challenging implementation, particularly in ensuring consistency and handling network partitions.
Raft is designed to be more understandable and easier to implement than Paxos, but still requires careful management of leader elections and log replication to maintain system correctness and performance.
Complexity and resource overhead:
2PC has low complexity and resource overhead but struggles with blocking and recovery issues in the event of failures.
3PC involves higher complexity and resource usage due to its additional phase and more complex failure handling, which can be challenging to implement effectively.
Paxos is complex and incurs significant resource overhead because of its multiple communication rounds and need for consensus across nodes, impacting performance and scalability.
Raft reduces complexity and resource overhead compared to Paxos by using a leader-based model, which simplifies operations and minimizes communication rounds, though it may still face scalability challenges.
Long story short, the consensus algorithms ensure reliability and consistency across distributed systems; usually, they handle the complexities of synchronous and asynchronous communication in distributed systems, ensuring that agreement can be reached even when messages are delayed or nodes fail.
Leslie Lamport’s pioneering work on Paxos laid the groundwork for modern consensus mechanisms, while etcd
utilize consensus algorithms to manage distributed configuration and coordination, and Diego
, a core component of Cloud Foundry, relies on similar principles for its scheduling and management. John Ousterhout’s insights into distributed systems further enhance our understanding of consensus mechanisms, while permissioned
blockchains leverage these algorithms to control access and maintain security.
Consensus algorithms are fundamental in computer science, especially in the domain of cloud computing, where they ensure that distributed systems can achieve agreement on a single value despite failures or network issues. These consensus mechanisms are crucial for maintaining consistency and reliability across distributed environments. However, implementing these algorithms involves a trade-off between performance and fault tolerance, as different consensus mechanisms offer various solutions to balance these needs. There are several variants of consensus algorithms that we discussed, including 2PC, 3PC, Paxos, and Raft, each designed to address specific challenges and requirements within System Design.
The following are some relevant courses that will help you to further your learning in the System Design and distributed systems domain:
Free Resources