Partitioning

Understand the data partitioning models along with their pros and cons.

Why partitioning?

Data is an asset for any organization. Increasing data and concurrent read/write traffic to the data put scalability pressure on traditional databases, and as a result, the latency and throughput are affected. Traditional databases are attractive due to their properties such as range queriesA range query is a common database operation that retrieves all records where some value is between an upper and lower boundary., secondary indicesA secondary index is a way to efficiently access records in a database by means of some piece of information other than the primary key., and transactionsA transaction is a single logical unit of work that accesses and possibly modifies the contents of a database. with the ACID properties.

At some point, a single node-based database is not enough to tackle the load and we might need to distribute the data over many nodes, but still export all the nice properties of relational databases. Though in practice it has proved challenging to provide single-node database-like properties over a distributed database.

One solution is to move data to a NoSQL-like system. However, the historical codebase and its close cohesion with traditional databases make it an expensive problem to tackle.

Organizations might scale traditional databases by using a third-party solution. But often, integrating a third-party solution has its complexities. More importantly, there are abundant opportunities to optimize for the specific problem at hand and get much better performance than a general-purpose solution.

Data partitioning (or sharding) enables us to use multiple nodes where each node manages some part of the whole data. To handle increasing query rates and data amounts, we strive for balanced partitions and balanced read/write load. We discuss different ways to partition data, related challenges, and their solutions in this lesson.

Sharding

To divide load among multiple nodes, we need to partition the data by a phenomenon known as partitioning or sharding. In this approach, we split a large dataset into smaller chunks of data stored at different nodes on our network.

The partitioning must be balanced, such that each partition receives about the same amount of data. If partitioning is unbalanced, the majority of queries will fall into a few partitions. Partitions that are heavily loaded will create a system bottleneck. The efficacy of partitioning will be harmed as a significant portion of data retrieval queries will be sent to the nodes that carry the highly congested partitions. Such partitions are known as Hotspots. Generally, we use the following ways to shard the data.

Vertical sharding
Horizontal sharding

Vertical sharding

We can put different tables in various database instances (that might be running on a different physical server). We might break a table into multiple tables such that some columns are in one table, while the rest are in the other. Often care should be taken if there are joins between multiple tables. We might like to keep such tables together on one shard.

Often vertical sharding is used to increase the speed of data retrieval from a table consisting of columns containing very wide text or Binary Large Object (BLOB). In this case, the column having large text or BLOB is split into a different table. As shown in the following figure Employee table is divided into two tables: a reduced Employee table and EmployeePicture table. The EmployePicture table has just two columns EmployeID and Picture, separated from the original table. Moreover, the primary key EmpoloyeeID of the Employee table is added in both partitioned tables. This makes the data read and write easier and the reconstruction of the table is performed efficiently.

Vertical sharding has its ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Partitioning

Why partitioning?

Sharding

Vertical sharding