Vertical Sharding of Database

Often we have slow-moving data in the distributed databases which need to be separated from the other data. Similarly, there might be some sensitive data like passwords that need extra security controls. Vertical sharding of databases plays a crucial role in all these cases by separating desired data from the rest of the data or enabling extra security controls over some data.

Vertical sharding

We can put different tables in different database instances (that might be running on a different physical server). We might break a table into multiple tables such that some columns are in one table, while the rest are in the other. Often care should be taken if there are joins between multiple tables. We might like to keep such tables together on one shard.

Often vertical sharding is used to increase the speed of data retrieval from a table consisting of columns containing very wide text or Binary Large Object (BLOB). In this case, the column having large text or BLOB is split into a different table. As shown in the following figure Employee table is divided into two tables: a reduced Employee table and EmployeePicture table. The EmployePicture table has just two columns EmployeID and Picture, separated from the original table. Moreover, the primary key EmpoloyeeID of the Employee table is added in both partitioned tables. This makes the data read and write easier and the reconstruction of the table is performed efficiently.

A well-partitioned database architecture evenly distributes data and the workload across all the distributed nodes. To maintain an expected level of performance consistency, queries are accurately directed to the desired node and the miss rate is minimized. There should be a balance in keeping the number of distributed database nodes. A large number of nodes would unnecessarily consume the database engine resources along with producing contention spots even on large-scale hardware. Therefore, there should be a balance between the commitment to optimize query performance and the goal to consolidate, to utilize resources with a better approach to cost (latency and resource consumption) reduction.

Distributed JOIN operations

In relational algebra, as well in RDBMS, join operations are considered the most expensive operations. When joining two relations (or tables), suppose table A and B, every tuple in A is compared with every tuple in B to check if the join condition is satisfied. If the condition is satisfied the rows are concatenated and copied into the result relation. A distributed join takes two or more partitioned tables and performs the join operations on them using the two strategies, explained below.

There are two basic join strategies in partitioned (distributed) tables: Simple join and Semi join. Simple join tries to directly optimize the ordering of join operation while the Semi join is an alternative approach that tries to minimize the communication costs during join operations.

Simple join operation

Simple join operations minimize the query processing cost (execution time, the number of disk accesses, and the communication costs across different nodes) by optimizing the ordering of join operations of relations. As an example, we assume two relations A and B which are stored on two nodes, say node A and node B. Moreover, it is assumed that relations transfer among different nodes of the distributed system is to perform on a set-at-a-timeWhere a set of tuples is transferred at once if the table size is very large. basis instead of tuple-at-a-timeWhere one by one tuple is transferred instead of a set of tuples. bases.

Let us consider a simple query that involves the joining of two relations A and B based on some attribute (Attr) or condition, that are stored at nodes A and B, respectively. In performing $A\Join_{Attr} B$ ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Vertical Sharding of Database