...

/

Vertical Sharding of Database

Vertical Sharding of Database

Often we have slow-moving data in the distributed databases which need to be separated from the other data. Similarly, there might be some sensitive data like passwords that need extra security controls. Vertical sharding of databases plays a crucial role in all these cases by separating desired data from the rest of the data or enabling extra security controls over some data.

Vertical sharding

We can put different tables in different database instances (that might be running on a different physical server). We might break a table into multiple tables such that some columns are in one table, while the rest are in the other. Often care should be taken if there are joins between multiple tables. We might like to keep such tables together on one shard.

Often vertical sharding is used to increase the speed of data retrieval from a table consisting of columns containing very wide text or Binary Large Object (BLOB). In this case, the column having large text or BLOB is split into a different table. As shown in the following figure Employee table is divided into two tables: a reduced Employee table and EmployeePicture table. The EmployePicture table has just two columns EmployeID and Picture, separated from the original table. Moreover, the primary key EmpoloyeeID of the Employee table is added in both partitioned tables. This makes the data read and write easier and the reconstruction of the table is performed efficiently.

A well-partitioned database architecture evenly distributes data and the workload across all the distributed nodes. To maintain an expected level of performance consistency, queries are accurately directed to the desired node and the miss rate is minimized. There should be a balance in keeping the number of distributed database nodes. A large number of nodes would unnecessarily consume the database engine resources along with producing contention spots even on large-scale hardware. Therefore, there should be a balance between the commitment to optimize query performance and the goal to consolidate, to utilize resources with a better approach to cost (latency and resource consumption) reduction.

Distributed JOIN operations

In relational algebra, as well in RDBMS, join operations are considered the most expensive operations. When joining two relations (or tables), suppose table A and B, every tuple in A is compared with every tuple in B to check if the join condition is satisfied. If the condition is satisfied the rows are concatenated and copied into the result relation. A distributed join takes two or more partitioned tables and performs the join operations on them using the two strategies, explained below.

There are two basic join strategies in partitioned (distributed) tables: Simple join and Semi join. Simple join tries to directly optimize the ordering of join operation while the Semi join is an alternative approach that tries to minimize the communication costs during join operations.

Simple join operation

Simple join operations minimize the query processing cost (execution time, the number of disk accesses, and the communication costs across different nodes) by optimizing the ordering of join operations of relations. As an example, we assume two relations A and B which are stored on two nodes, say node A and node B. Moreover, it is assumed that relations transfer among different nodes of the distributed system is to perform on a set-at-a-timeWhere a set of tuples is transferred at once if the table size is very large. basis instead of tuple-at-a-timeWhere one by one tuple is transferred instead of a set of tuples. bases.

Let us consider a simple query that involves the joining of two relations A and B based on some attribute (Attr) or condition, that are stored at nodes A and B, respectively. In performing A⋈AttrBA\Join_{Attr} B ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy