Replication and CQL DDL: CREATE KEYSPACE statement

Explore the concept of replication in Apache Cassandra, which is crucial for achieving high availability and fault tolerance in a distributed database system.

To ensure data is available in face of node failure, Apache Cassandra allows multiple copies (replicas) of data to be stored on different nodes. All replicas are equal and no replica is treated as primary/master.

Replication factor (RF)

The replication factor (RF) defines the number of replicas across the cluster. A replication factor of 1 specifies that a partition is stored only once in the cluster, 2 specifies that the partition is saved on two different nodes, etc. The replication factor is defined for each datacenter and is generally set to greater than one and less than the number of nodes. An RF of 3 is preferred as it provides a balance between availability, consistency, and performance.

Replication strategy

A replication strategy determines the nodes where the replicas are placed. Cassandra supports the following two replication strategies:

  1. The NetworkTopologyStrategy is selected for most deployments as it is datacenter and rack aware and enables future expansion to multiple datacenters. 

  2. The SimpleStrategy is used for a single datacenter with only one rack. 

Let’s consider a cluster with two datacenters. Datacenter-A has a replication factor of 2, while Datacenter-B has a replication factor of 3.

Any node can handle any write request. Let’s assume that the write request is received in Datacenter-A, the node receiving the request becomes the coordinator and forwards the data to the two nodes responsible for the token range in this datacenter. The coordinator node then asynchronously sends the write request to Datacenter-B, and data is written to the three nodes responsible for the token range in Datacenter-B. This data is maintained in both datacenters, as demonstrated in the diagram below.

Get hands-on with 1300+ tech skills courses.