Introduction to Distributed Systems for Dummies/

...

Partitioning Techniques

Learn different partition techniques for a distributed system.

We'll cover the following...

Range-based partitioning
- Pros
- Cons
Hash partitioning
- Pros
- Cons
Key takeaways

How we partition our data in a distributed system matters significantly, so based on the data we have, we must choose the correct partitioning strategy. Otherwise, things get complicated in the long run.

Good partitioning technique results in even distribution of data among the nodes. If we had $n$ nodes and we scale out to $2n$ nodes, a good partitioning strategy will help us to gain two times the performance from our system.

Before discussing partitioning techniques, let’s first assume that the data we have has some keys for each row. Each key identifies a row uniquely. This is more or less a common expectation in all different databases.

Let’s discuss a few techniques.

Range-based partitioning

In range-based partitioning, data is partitioned based on the ranges of the key. For example, say we have keys of type strings. Now the data with keys starting with ‘a’ to ‘j’ is stored in node 1, ‘k’ to ‘p’ in node 2, and ‘q’ to ‘z’ in node 3.

Press + to interact

This technique is very simple. However, in many cases range-based partitioning results in storing more data in a few nodes and a low volume of data in other nodes. In the above example, the partition ‘k’ to ‘p’ could be larger than other partitions even if the key range is smaller. This means node 2 will have more data than the other nodes.

Note: Sometimes poor partitioning techniques can result in data skewness in the system. Partitions are skewed when one or few partitions have significantly more data than other partitions.

Skewed partitions can create hotspots where some nodes receive a huge amount of read and write requests compared to others.

Pros

Facilitates range queries. Since a node contains a continuous range of keys, they can be sorted. A range query on the key can be quickly retrieved.

Cons

Improper selection of range boundaries can easily lead to hotspots.
If one or a few nodes become hotspots, it might be costly to rebalance the data across the nodes.

If data has to be moved among the nodes to evenly distribute it in the system, then we call it rebalancing. The performance of a partitioning strategy is greatly defined by how much rebalancing is required if a node becomes a hotspot or a node is added or removed from the system.

Hash partitioning

One easy way of avoiding skewness in partitions is hash partitioning. In this technique, the key is hashed using some hash function. A good function ensures the keyspace gets evenly distributed in a ...

Introduction

What Distributed Systems Achieve for Us

Data in Distributed Systems

Communication Between Nodes

Data Processing in Large Scale

Distributed System Architectural Patterns

Case Study 1: Apache Spark

Case Study 2: Apache Druid

Conclusion

Partitioning Techniques

Range-based partitioning

Pros

Cons

Hash partitioning