...

/

Cost Benefit Analysis of Centralised and Distributed Database

Cost Benefit Analysis of Centralised and Distributed Database

Learn when to use horizontal sharding instead of vertical sharding and vice versa.

Which database sharding approach is best for your organization?

For your organization and business to grow, prevent downtime, and reduce latency, you must scale your resources accordingly. You can scale these resources through a combination of adjustments to network bandwidth, CPU and physical memory requirements, and hard disk adjustments. Both horizontal sharding and vertical sharding involve adding resources to your computing infrastructure; your business stakeholders must decide which is right for your organization.

The following sections explains the pros and cons of no-sharding vs sharding.

Pros and cons of centralized database (No-Sharding)

Advantages

  • Data maintenance such as updating and taking backups of a centralized database.

  • Centralized database has more data consistency than a distributed database.

  • Is more efficient for businesses having a small amount of data to store.

Disadvantages

  • Centralized database takes time when data is retrieved because the number of users accessing the centralized database is higher than the distributed database.

  • Database is inaccessible in the event of failures.

Advantages and disadvantages of distributed database

Advantages

  • Accessing data in a distributed database is easier and fast as data is retrieved from the nearest database shard or the one frequently used.

  • Data not required by local applications is not available to unauthorized users.

  • Provides materialized view for frequent data access.

  • Intensive transactions consisting of queries can be divided into subqueries which can be processed in a parallel fashion.

Disadvantages

  • Sometimes data is required from multiple sites, which takes more time than expected.

  • As relations are divided (vertically/horizontally) among different nodes; therefore, expensive operations such as joins and unions are needed to reconstruct complete relations.

  • Maintaining consistency of data across sites is difficult and requires extra measures.

  • Updation and backups in distributed database take time to synchronize data

Query optimization and processing speed in a distributed database

A transaction in the distributed database depends on the type of query, number of sites (shards) involved, communication speed, and other factors such as underline hardware and the type of database used. However, as an example we would assume a query accessing three tables Store, Product, and Sales.

The number of attributes in each table is given in the following figure.

Assume the distribution of both tables on different sites as

  • The Store table has 10,000 tuples stored at site A
  • The Product table has 100,000 tuples stored at site B
  • The Sales has 1 Million tuples store at site A.

Now assume that the following query needs to be processed.

Select Store_key from (Store JOIN Sales JOIN Product)
where Region= 'East' AND Brand='Wolf'

Assume every stored tuple is 25 bytes long. Furthermore, estimated cardinalities of certain intermediate results are:

  • Number of Wolf brand =10
  • Number of East region is 100,000

Communication assumptions are:

  • Data rate == 50,000 bit per second
  • Access delay = 0.1 second

Now let’s consider the following three strategies for processing the above query.

Let

a=a= total access delay

b=b= ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy