Sharding and partitioning are both popular terms in the world of databases and distributed systems. Interestingly, these terms frequently confuse learners due to their close proximity, and that they are often used interchangeably. In this Answer, we'll learn what these terms mean and how they differ.
In database systems, partitioning refers to splitting data based on a predefined set of attributes. Think of partitioning as a generic term for all data-splitting methods in a database. The purpose of partitioning is to increase the performance of database queries while ensuring the maintainability of the data.
While partitioning is a generic term for data splitting in a database, sharding is used for a specific type of partitioning, popularly known as horizontal partitioning.
In sharding, data is split horizontally into multiple shards. The schema of the table is replicated in every shard, and a unique portion of the whole table lives in each of them. A key is predefined to determine which shard a piece of data lives. The database checks the key whenever a new row comes and writes the data to a specific shard.
Consider an example of a database table with a created_at
column. The table gets populated every day and is growing. We have decided to split the table horizontally into multiple shards based on the created_at
column to prevent the table from becoming too big.
One strategy could be having a shard for each month. All the rows for a specific month will be persisted on the shard for that particular month.
This raises a question if there is horizontal partitioning, is there also vertical partitioning? The answer is yes!
In vertical partitioning, a table is split vertically. In the example above, think of splitting the table into two tables where the first two columns live in one table, and the last one lives in another. Both of the partitions have a shared key which is used to bridge the connection between related data, but the two vertical partitions do not have the same schema.
In short, partitioning is a generic term used for data splitting in a database system. Sharding is partitioning where data is split horizontally, and the schema is replicated in all the shards.