What is partitioning?

In the previous lesson, we discussed replication. With replication, we make multiple copies of the data, and each server contains the complete data. But if the data is too large and can’t be saved on a single instance? In that case, we may need to split the data and store it in different instances. This is called partitioning the data.

Partitioning the data will not solve the problem of data safety. If a node goes down, the data stored in that node will be lost. So, replication is required in addition to partitioning. Although it will increase the amount of data that needs to be stored, the data will not be lost in the case of a failure.

Advantages of partitioning

  • Partitioning helps in horizontal scaling of the data. If the data is spread across different servers, then the traffic can be routed to different servers, and and each server will have less load.
  • Sometimes it may not be possible to store data on a single server because of its size. In this case, using partitioning is the only option left.

When we increase the memory, hard disk, and computing power of a server, it is called vertical scaling. But when we use the resources of multiple computers, it is called horizontal scaling. Horizontal scaling is more useful because when the data increases, we can add more servers, and when it decreases, we can reduce the servers.

Disadvantages of partitioning

  • If a transaction involves multiple keys and the keys are available on different servers then the transaction is not supported.
  • Certain operations that involve multiple keys, such as the union of two sets are not possible if the keys are present in different instances.
  • It is very difficult to handle data when partitioning is used. Multiple rdb or aof files are generated, one for each server and they need to be combined together to get the complete data set.

Basic partitioning techniques

If we need to divide our keys, we should decide which key will be stored on which instance. There are two common methods that can be used to do this:

Range Partitioning

This technique can be used if the keys are in number format. Let’s suppose we have employee data stored in Redis, where the employee ID is the key. If we have three nodes available, then we can define that all the keys from range 0 to 10,000 will be stored on Node 1. Keys from range 10,000 to 20,000 will be stored on Node 2 and so on.

If the key is in String format, we can divide the keys based on the first character. So, all the keys starting from A to H can be assigned to one node, keys from I to P can be assigned to the second node, and so on.

There are a few disadvantages of using range partitioning. It may lead to an uneven distribution of keys. Let’s say about 70 percent of the keys start with A. Node 1 will then wave most of the load, and the other nodes will be relatively empty. This may defeat the purpose of partitioning. Another issue is that if one or more instances are lost, then the data stored on that instance will also be lost. If new instances are added, then we will have to redefine how the range partitioning is calculated.

Hash partitioning

If the key is not a number, we can use a hash function to convert it into a number. We can divide this number by the number of nodes available, and the node can be decided based on the remainder. For example, if the hash of a key is 658 and we have 3 nodes, then 658 leaves 1 as the remainder when divided by 3. We can then store this key on Node 1.

In this method of partitioning, the keys are evenly distributed. The number of nodes must be a prime number to get the most effective partitioning. If the number of nodes is even, there are more chances of a collision. If the number of nodes is constantly changing, it will impact the performance as some data will be dumped upon the removal of a node. A technique called presharding is used to avoid this.

Get hands-on with 1400+ tech skills courses.