Partitioning Schemes

The goal of partitioning data is to distribute data evenly across all the nodes in the system. Consider a system with 5 nodes that has a partition of data living on each of the 5 nodes and the data is evenly distributed.

widget

Theoretically, this system can handle 5 times more data and can achieve 5 times more read and write throughput than a system with a single node. We could also horizontally scale the system infinitely by simply adding more nodes and distributing the data evenly among the total nodes. However, this theoretical performance gain is usually not achievable in practice and scaling by adding new nodes brings its own challenges that we’ll cover later.

widget

How to partition data

The elephant in the room we have ignored so far is how do we actually go about partitioning data? To make the example concrete, let’s say we are trying to partition a collection of millions of song files in a database with five nodes. Some of the ways we can partition are:

Partition Randomly

Randomly assign a node to each song but storing the same number of songs on each node (assuming the total number of songs is exactly divisible by 5 for simplicity). The acute reader can immediately see the problem with this approach. We can store the file on a node but come retrieval time we have no way of knowing which node stored the file on. Partitioning is a two way street, we should be able to locate the node for storage and retrieval of a particular record deterministically and relatively quickly. Deterministically, means the destination node for a record is always worked out to be the same. Quickly is subjective but in general we can’t afford to run, say , a bunch of complex time-consuming mathematical formulas/equations to determine the destination node for a record.

widget

Partition by Range

Our second attempt at partitioning can be ...

Access this course and 1400+ top-rated courses and projects.