...

/

Cost Benefit Analyses

Cost Benefit Analyses

Learn when to use horizontal sharding instead of vertical sharding and vice versa.

Which database sharding approach is best?

Both horizontal sharding and vertical sharding involve adding resources to our computing infrastructure; our business stakeholders must decide which is suitable for our organization. For our organization and business to grow, prevent downtime, and reduce latency, we must scale our resources accordingly. We can scale these resources through a combination of adjustments to CPU, physical memory requirements, hard disk adjustments, and network bandwidth.

The following sections explain the pros and cons of no-sharding vs. sharding.

Pros and cons of centralized database

Advantages

  • Data maintenance such as updating and taking backups of a centralized database is easy.

  • Centralized databases provide strong consistency and ACID transactions than a distributed database.

  • Centralized databases provide a much simpler programming model for the end programmers as compared to distributed databases.

  • It is more efficient for businesses having a small amount of data to store that can reside on a single node.

Disadvantages

  • Centralized database can slow down (hence high latency for end-users) when the number of queries per second accessing the centralized database is approaching single-node limits.

  • Centralized database has a single point of failure due to which its probability of not being accessible is much higher.

Advantages and disadvantages of distributed database

Advantages

  • Accessing data in a distributed database is fast as data is retrieved from the nearest database shard or the one frequently used.

  • Data with different levels of distribution transparencyDistributionTransparency can be stored in separate places.

  • Intensive transactions consisting of queries can be divided into multiple optimized subqueries which can be processed in a parallel fashion.

Disadvantages

  • Sometimes data is required from multiple sites, which takes more time than expected.

  • As relations are partitioned (vertically/horizontally) among different nodes; therefore, operations such as joins need to reconstruct complete relations by carefully fetching data, and such operations can become much more expensive and complex.

  • Maintaining consistency of data across sites is difficult and requires extra measures.

  • Updation and backups in distributed database take time to synchronize data.

Query optimization and processing speed in a distributed database

A transaction in the distributed database depends on the type of query, number of sites (shards) involved, communication speed, and other factors such as underline hardware and the type of database used. However, as an example we would assume a query accessing three tables Store, Product, and Sales residing on different sites.

The number of attributes in each table is given in the following figure.

Assume the distribution of both tables on different sites as

  • The Store table has 10,000 tuples stored at site A
  • The Product table has 100,000 tuples stored at site B
  • The Sales has 1 Million tuples stored at site A.

Now assume that the following query needs to be processed.

Select Store_key from (Store JOIN Sales JOIN Product)
where Region= 'East' AND Brand='Wolf';

The above query performs the join operations on Store, Sales, and Product tables and retrieves the Store_key values from the table generated in the result of join operations.

Assume every stored tuple is 200 bits (25 bytes) long. Furthermore, estimated cardinalities of certain intermediate results are:

  • Number of Wolf brand is 10
  • Number of East region stores is 100,000

Communication assumptions are:

  • Data rate
...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy